An 8B-parameter voice in your GPU
MisoTTS brings Sesame-style conversational speech synthesis to local hardware, with a Llama backbone and a stubbornly English-only vocabulary.

What it does MisoTTS is an 8-billion-parameter text-to-speech model that generates conversational audio from text, optionally cloning a voice from a short audio prompt. It runs locally via a single Python script that downloads weights from Hugging Face on first run. The output is watermarked by default using Sony’s SilentCipher.
The interesting bit The architecture borrows from Sesame’s CSM: a Llama-8B backbone handles interleaved text and audio tokens, while a separate 300M decoder predicts the 32 codebooks of each audio frame. This two-stage setup lets the big model focus on “what to say and how to emote” and the small model handle the acoustic details.
Key highlights
- 8B Llama backbone + 300M audio decoder, 32 Mimi codebooks, 2,048 max sequence length
- Voice cloning from prompted audio with transcript alignment
- Default inference in
bfloat16; CUDA strongly recommended (VRAM requirements depend on precision) - Watermarking enabled by default via SilentCipher
- English only — no multilingual support yet
Caveats
- The README warns about watermarking timeouts on first download and suggests rerunning the command
- “Sufficient VRAM” is vague; no specific GPU requirements or benchmarks are listed
- English-only support is explicit, so don’t expect Mandarin or Spanish out of the box
Verdict Worth a look if you’re building voice agents or need local, emotive TTS without API calls. Skip it if you need multilingual support, are running on CPU-only hardware, or want detailed performance numbers before committing GPU time.