← all repositories
fishaudio/fish-speech

A 4B-parameter TTS that whispers, shouts, and laughs on command

Fish Speech S2 Pro turns text into emotionally controllable speech across 80+ languages using inline tags like [whisper] or [angry].

30.7k stars Python Image · Video · Audio
fish-speech
Velocity · 7d
+32
★ / day
Trend
steady
star history

What it does

Fish Speech S2 Pro is a 4-billion-parameter text-to-speech system trained on over 10 million hours of audio. It generates speech in 80+ languages and accepts natural-language control tags embedded directly in the text — no separate emotion models or post-processing required. The project provides inference code, server deployment via SGLang or vLLM, and a WebUI.

The interesting bit

The “Dual-AR” architecture treats speech generation like an LLM: a slow 4B-parameter autoregressive model predicts semantic structure, while a fast 400M-parameter model fills in nine residual audio codebooks in parallel. Because the structure mirrors standard decoder-only transformers, it piggybacks on existing LLM serving infrastructure — continuous batching, paged KV cache, CUDA graphs — rather than reinventing the inference stack.

Key highlights

  • Inline emotional control: 15,000+ supported tags including free-form descriptions like [professional broadcast tone]; insert anywhere in text for sub-word granularity
  • Benchmark claims: lowest reported WER on Seed-TTS Eval for Chinese (0.54%) and English (0.99%), beating listed closed-source competitors; 81.88% win rate on EmergentTTS-Eval
  • Performance: RTF 0.195, ~100ms time-to-first-audio on a single H200 via SGLang; 3,000+ acoustic tokens/second throughput
  • Multi-speaker, multi-turn: single reference audio with multiple speakers can be disambiguated via <|speaker:i|> tokens; context window carries across dialogue turns
  • Voice cloning: 10-30 second reference samples claimed sufficient without fine-tuning

Caveats

  • License is custom, not open source: released under the “Fish Audio Research License” — commercial use restrictions apply, and the authors state they “will take action against any violation”
  • Hardware expectations: benchmark numbers cited are from an NVIDIA H200; no explicit minimum specs for local inference, but 4B + 400M parameters plus 10-codebook RVQ decoding implies significant GPU memory requirements
  • Self-reported benchmarks: all comparison numbers come from the project’s own technical report; no independent third-party evaluation is referenced in the README

Verdict

Worth exploring if you need expressive, multilingual TTS with fine-grained prosodic control and already run GPU inference infrastructure. Skip if you need permissively licensed weights for commercial products, or if you’re hoping to run quality TTS on consumer hardware without optimization work.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.