← all repositories
Kyubyong/dc_tts

A TTS model that learns from Nick Offerman and Kate Winslet

Kyubyong's DC-TTS implementation tests whether a convolution-based speech synthesizer can train on tiny, quirky datasets—not just standard benchmarks.

1.2k stars Python Image · Video · Audio
dc_tts
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does This is a TensorFlow implementation of DC-TTS, a text-to-speech system built entirely on deep convolutional networks with guided attention. It converts text to mel spectrograms (Text2Mel), then to linear spectrograms (SSRN), and finally to audio. The repo includes training pipelines, synthesis scripts, and pretrained models for the LJ Speech dataset.

The interesting bit The author didn’t just replicate the paper—he stress-tested it. Nick Offerman’s 18-hour audiobooks and Kate Winslet’s 5-hour recording join the standard LJ Speech benchmark, plus a Korean dataset. The goal: see if the model learns when data is scarce and voices are, shall we say, characterful. He also had to deviate from the paper—adding layer normalization, decaying the learning rate, and applying dropout where the original authors stayed silent.

Key highlights

  • Purely convolutional architecture, no recurrent layers—faster than Tacotron per the author
  • Guided attention mechanism that reportedly locks alignment early (monotonic attention plots “almost from the beginning”)
  • Trained on four distinct datasets: LJ Speech (24h), Nick Offerman (18h), Kate Winslet (5h), and Korean KSS (12h)
  • Generated samples at multiple training steps posted to SoundCloud for direct comparison
  • Pretrained LJ model available via Dropbox; Harvard Sentences included for quick synthesis tests

Caveats

  • Requires TensorFlow ≥ 1.3, and tf.contrib.layers.layer_norm API has shifted since then—version fragility is visible
  • The author couldn’t replicate the paper’s “trained within a day” claim; training speed promises from 2017 may not hold
  • Simultaneous training of Text2Mel and SSRN failed; the two-stage pipeline is mandatory, not optional

Verdict Worth a look if you’re studying TTS architectures or need a convolutional baseline to compare against newer transformers. Skip it if you want production-ready, maintained code—this is a research sandbox from the TensorFlow 1.x era, and the author treats it as such.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.