← all repositories
soobinseo/Transformer-TTS

TTS on a Transformer budget: faster training, same griffin-lim caveats

A PyTorch reimplementation that swaps RNNs for self-attention in speech synthesis, trading training speed for the usual vocoder compromises.

691 stars Python Image · Video · Audio
Transformer-TTS
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

This is a PyTorch implementation of the 2018 paper “Neural Speech Synthesis with Transformer Network.” It replaces the recurrent seq2seq backbone of Tacotron-style TTS with a Transformer encoder-decoder, using multi-head attention to map text to mel spectrograms. A CBHG-based postnet (borrowed from Tacotron) refines the output, and Griffin-Lim reconstructs waveforms—no WaveNet vocoder involved.

The interesting bit

The author reports training roughly 3–4× faster than Tacotron, at about 0.5 seconds per step. The attention plots reveal something the paper doesn’t fully prepare you for: diagonal alignment only emerges in select multi-head attention layers after substantial training (around 15k steps), and decoder self-attention in particular stays messy. The scaled positional encoding also diverges from paper values—encoder alpha decays rather than rising to 4.

Key highlights

  • Trains on LJSpeech (13,100 text-audio pairs) with a two-stage pipeline: autoregressive Transformer, then separate postnet training
  • Pretrained models available (160k steps for AR model, 100k for postnet)
  • Includes attention visualization for all 12 multi-head splits across 3 layers
  • Noam-style warmup/decay learning rate scheduling
  • Straightforward file layout: hyperparams.py for configuration, separate scripts for data prep, training, and synthesis

Caveats

  • Hard dependency on PyTorch 0.4.0, which is now ancient
  • Generated samples at 160k steps are explicitly noted as “not converged yet”; long sentences suffer
  • Stop token loss had to be abandoned—it broke training entirely
  • Gradient clipping (norm=1) and learning rate tuning are described as critical and finicky

Verdict

Worth a look if you’re studying how Transformers behave in autoregressive audio generation, or need a faster-training baseline to iterate on. Skip it if you want production-ready TTS; the Griffin-Lim output, unconverged samples, and PyTorch 0.4.0 dependency make this a research reference rather than a shipping model.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.