Merlin: Edinburgh's recipe-driven TTS toolkit from the deep-learning before-times
A 2016-era neural speech synthesis system that still expects you to bring your own vocoder and text processor.

What it does
Merlin is a Python toolkit for building deep neural network models for statistical parametric speech synthesis — think text-to-speech where you train models to generate acoustic features rather than raw audio. It ships with “recipes” in the style of the Kaldi speech recognition toolkit, walking you through building complete voice systems from data.
The interesting bit
The project sits at an archaeological layer of deep learning: built on Theano, optionally supports TensorFlow and Keras, and explicitly requires you to bolt on external tools for the jobs it won’t do. You bring the front-end text processor (Festival) and the vocoder (STRAIGHT or WORLD); Merlin handles the neural net middle. This modularity was principled design in 2016; today it reads as deliberate complexity.
Key highlights
- Born at the University of Edinburgh’s Centre for Speech Technology Research (CSTR)
- Apache 2.0 licensed, with explicit commercial-use permission
- Python 2.7–3.6 compatible (a tell in itself)
- Includes working demo voice (
egs/slt_arctic) with published audio samples - Extensive tutorial ecosystem: Interspeech 2017 course, blog posts, documentation site
Caveats
- Core dependency is Theano, which reached end-of-life in 2017
- Requires UNIX; Windows support is not mentioned
- No candidate images available in the repository for visual reference
Verdict
Worth studying if you’re researching the evolution of TTS architectures or need to reproduce historical speech synthesis work. Skip it if you want a modern, batteries-included text-to-speech pipeline — the field has moved to end-to-end neural models that don’t ask you to curate vocoders.