← all repositories
MisoLabsAI/MisoTTS

An 8B-parameter voice in your GPU

MisoTTS brings Sesame-style conversational speech synthesis to local hardware, with a Llama backbone and a stubbornly English-only vocabulary.

2.2k stars Python Image · Video · Audio
MisoTTS
Velocity · 7d
+119
★ / day
Trend
steady
star history

What it does MisoTTS is an 8-billion-parameter text-to-speech model that generates conversational audio from text, optionally cloning a voice from a short audio prompt. It runs locally via a single Python script that downloads weights from Hugging Face on first run. The output is watermarked by default using Sony’s SilentCipher.

The interesting bit The architecture borrows from Sesame’s CSM: a Llama-8B backbone handles interleaved text and audio tokens, while a separate 300M decoder predicts the 32 codebooks of each audio frame. This two-stage setup lets the big model focus on “what to say and how to emote” and the small model handle the acoustic details.

Key highlights

  • 8B Llama backbone + 300M audio decoder, 32 Mimi codebooks, 2,048 max sequence length
  • Voice cloning from prompted audio with transcript alignment
  • Default inference in bfloat16; CUDA strongly recommended (VRAM requirements depend on precision)
  • Watermarking enabled by default via SilentCipher
  • English only — no multilingual support yet

Caveats

  • The README warns about watermarking timeouts on first download and suggests rerunning the command
  • “Sufficient VRAM” is vague; no specific GPU requirements or benchmarks are listed
  • English-only support is explicit, so don’t expect Mandarin or Spanish out of the box

Verdict Worth a look if you’re building voice agents or need local, emotive TTS without API calls. Skip it if you need multilingual support, are running on CPU-only hardware, or want detailed performance numbers before committing GPU time.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.