A 1.6B-parameter voice actor that laughs, coughs, and clears its throat
Dia generates multi-speaker dialogue with nonverbal sounds in a single pass, no post-processing required.

What it does
Dia is a 1.6B-parameter TTS model that turns transcripts into spoken dialogue between two speakers, marked with [S1] and [S2] tags. It also handles nonverbal cues—(laughs), (sighs), (clears throat), and about fifteen others—baked directly into the generated audio. You can condition it on a short audio clip for voice cloning, or let it invent new voices each run.
The interesting bit Most TTS systems generate clean speech and call it a day. Dia treats conversation as the native format, not a concatenation of monologues. The model was inspired by SoundStorm and Parakeet, and it runs at 2.1x real-time on an RTX 4090 in bfloat16—fast enough to iterate on scripts without brewing coffee between generations.
Key highlights
- Single-pass dialogue generation with speaker tags and nonverbal sounds
- Voice cloning via 5–10 second audio prompts (with transcript prepended)
- Hugging Face Transformers integration; also runs standalone with pip or uv
- ~4.4 GB VRAM in mixed precision, ~7.9 GB in float32
- Apache 2.0 license; weights hosted on Hugging Face
Caveats
- English only; CPU support and quantization are on the TODO list
- Short inputs (<5s) sound unnatural; long inputs (>20s) speed up unnaturally
- RTX 5000-series GPUs need torch 2.8 nightly (see issue #26)
- No fixed default voice—speaker consistency requires seed locking or audio prompting
Verdict Worth a spin if you’re prototyping podcasts, games, or any project where two people need to sound like they’re actually talking. Skip it for now if you need non-English, CPU-only deployment, or production-grade voice consistency without prompt engineering.