← all repositories
2noise/ChatTTS

A TTS model that laughs at your jokes — literally

ChatTTS generates conversational speech with baked-in prosodic controls like laughter, pauses, and interjections for LLM assistants.

39.4k stars Python Image · Video · Audio
ChatTTS
Velocity · 7d
+53
★ / day
Trend
steady
star history

What it does ChatTTS is a text-to-speech model built for dialogue, not audiobooks. It turns text into natural-sounding speech with support for multiple speakers, and it can inject conversational tics like laughter, pauses, and breaks via special tokens you embed directly in the text. The project ships as a PyPI package with a WebUI, command-line tool, and Python API.

The interesting bit The authors intentionally degraded their open-source model — adding high-frequency noise and heavy MP3 compression — to make it harder to misuse for deepfakes. They also trained an internal detection model they plan to release. It’s a rare case of a project sabotaging its own output quality for safety reasons.

Key highlights

  • Trained on 100,000+ hours of Chinese and English audio; the open-source release is a 40,000-hour pretrained base without supervised fine-tuning
  • Token-level prosody control: [laugh], [uv_break], [lbreak], plus sentence-level prompts like oral_2 and break_6
  • ~4GB VRAM minimum; RTF around 0.3 on an RTX 4090
  • Supports streaming generation and zero-shot speaker inference via DVAE encoder
  • Code is AGPLv3+; model weights are CC BY-NC 4.0 (academic/research only)

Caveats

  • Autoregressive instability: the FAQ admits multi-speaker drift and quality issues are “generally difficult to avoid” — you may need to sample multiple times
  • English support is marked experimental; the model is primarily optimized for Chinese
  • Several “unrecommended” optional dependencies (TransformerEngine, FlashAttention-2) are explicitly broken or slow
  • No open-source emotion control yet; only laughter and breaks are available

Verdict Worth a look if you’re building Chinese-first voice agents or researching controllable prosody in TTS. Skip it if you need production-ready English voices or commercial licensing — the model weights are strictly non-commercial.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.