Speech synthesis that skips the queue
A non-autoregressive Transformer TTS implementation that generates spectrograms in one forward pass instead of token-by-token.

What it does
TransformerTTS generates mel spectrograms from text using a non-autoregressive Transformer, then hands off to a separate vocoder (MelGAN, HiFiGAN, or Griffin-Lim) to produce actual audio. It is built in TensorFlow 2 and ships with a pre-trained LJSpeech model you can run from a one-liner CLI or a Python script.
The interesting bit
The project ditches autoregressive generation entirely—no token-by-token mel decoding, no attention collapse on long sentences. Instead it uses a dedicated “Aligner” model to extract durations, then predicts everything in parallel. Pitch and speed are exposed as controllable parameters, which is the kind of affordance you usually sacrifice for speed.
Key highlights
- One-shot inference: the forward model generates the full spectrogram in a single pass
- Pre-trained LJSpeech model with weights at 5K-step intervals from 60K to 100K
- Compatible with MelGAN and HiFiGAN vocoders; older WaveRNN support was dropped in late 2020
- Duration extraction uses Dijkstra’s algorithm, which is either elegant or overkill depending on your worldview
- Includes a Colab notebook for trying synthesis without installing anything
Caveats
- The pre-trained API requires checking out a specific commit (
493be634...); drift from that and things may break - Training is a two-stage pipeline (Aligner → duration extraction → TTS), so “quick fine-tuning” is not really in the cards
- Only LJSpeech pre-trained weights are provided; other voices mean training from scratch
Verdict
Worth a look if you need controllable, parallel TTS and can live within the LJSpeech voice or train your own. If you want plug-and-play multilingual voices or a single-stage training loop, this is not your repo.