← all repositories
nari-labs/dia

A 1.6B-parameter voice actor that laughs, coughs, and clears its throat

Dia generates multi-speaker dialogue with nonverbal sounds in a single pass, no post-processing required.

19.3k stars Python Image · Video · Audio
dia
Velocity · 7d
+47
★ / day
Trend
steady
star history

What it does Dia is a 1.6B-parameter TTS model that turns transcripts into spoken dialogue between two speakers, marked with [S1] and [S2] tags. It also handles nonverbal cues—(laughs), (sighs), (clears throat), and about fifteen others—baked directly into the generated audio. You can condition it on a short audio clip for voice cloning, or let it invent new voices each run.

The interesting bit Most TTS systems generate clean speech and call it a day. Dia treats conversation as the native format, not a concatenation of monologues. The model was inspired by SoundStorm and Parakeet, and it runs at 2.1x real-time on an RTX 4090 in bfloat16—fast enough to iterate on scripts without brewing coffee between generations.

Key highlights

  • Single-pass dialogue generation with speaker tags and nonverbal sounds
  • Voice cloning via 5–10 second audio prompts (with transcript prepended)
  • Hugging Face Transformers integration; also runs standalone with pip or uv
  • ~4.4 GB VRAM in mixed precision, ~7.9 GB in float32
  • Apache 2.0 license; weights hosted on Hugging Face

Caveats

  • English only; CPU support and quantization are on the TODO list
  • Short inputs (<5s) sound unnatural; long inputs (>20s) speed up unnaturally
  • RTX 5000-series GPUs need torch 2.8 nightly (see issue #26)
  • No fixed default voice—speaker consistency requires seed locking or audio prompting

Verdict Worth a spin if you’re prototyping podcasts, games, or any project where two people need to sound like they’re actually talking. Skip it for now if you need non-English, CPU-only deployment, or production-grade voice consistency without prompt engineering.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.