A TTS model that laughs at your jokes — literally
ChatTTS generates conversational speech with baked-in prosodic controls like laughter, pauses, and interjections for LLM assistants.

What it does ChatTTS is a text-to-speech model built for dialogue, not audiobooks. It turns text into natural-sounding speech with support for multiple speakers, and it can inject conversational tics like laughter, pauses, and breaks via special tokens you embed directly in the text. The project ships as a PyPI package with a WebUI, command-line tool, and Python API.
The interesting bit The authors intentionally degraded their open-source model — adding high-frequency noise and heavy MP3 compression — to make it harder to misuse for deepfakes. They also trained an internal detection model they plan to release. It’s a rare case of a project sabotaging its own output quality for safety reasons.
Key highlights
- Trained on 100,000+ hours of Chinese and English audio; the open-source release is a 40,000-hour pretrained base without supervised fine-tuning
- Token-level prosody control:
[laugh],[uv_break],[lbreak], plus sentence-level prompts likeoral_2andbreak_6 - ~4GB VRAM minimum; RTF around 0.3 on an RTX 4090
- Supports streaming generation and zero-shot speaker inference via DVAE encoder
- Code is AGPLv3+; model weights are CC BY-NC 4.0 (academic/research only)
Caveats
- Autoregressive instability: the FAQ admits multi-speaker drift and quality issues are “generally difficult to avoid” — you may need to sample multiple times
- English support is marked experimental; the model is primarily optimized for Chinese
- Several “unrecommended” optional dependencies (TransformerEngine, FlashAttention-2) are explicitly broken or slow
- No open-source emotion control yet; only laughter and breaks are available
Verdict Worth a look if you’re building Chinese-first voice agents or researching controllable prosody in TTS. Skip it if you need production-ready English voices or commercial licensing — the model weights are strictly non-commercial.