A 4B-parameter TTS that whispers, shouts, and laughs on command
Fish Speech S2 Pro turns text into emotionally controllable speech across 80+ languages using inline tags like [whisper] or [angry].

What it does
Fish Speech S2 Pro is a 4-billion-parameter text-to-speech system trained on over 10 million hours of audio. It generates speech in 80+ languages and accepts natural-language control tags embedded directly in the text — no separate emotion models or post-processing required. The project provides inference code, server deployment via SGLang or vLLM, and a WebUI.
The interesting bit
The “Dual-AR” architecture treats speech generation like an LLM: a slow 4B-parameter autoregressive model predicts semantic structure, while a fast 400M-parameter model fills in nine residual audio codebooks in parallel. Because the structure mirrors standard decoder-only transformers, it piggybacks on existing LLM serving infrastructure — continuous batching, paged KV cache, CUDA graphs — rather than reinventing the inference stack.
Key highlights
- Inline emotional control: 15,000+ supported tags including free-form descriptions like
[professional broadcast tone]; insert anywhere in text for sub-word granularity - Benchmark claims: lowest reported WER on Seed-TTS Eval for Chinese (0.54%) and English (0.99%), beating listed closed-source competitors; 81.88% win rate on EmergentTTS-Eval
- Performance: RTF 0.195, ~100ms time-to-first-audio on a single H200 via SGLang; 3,000+ acoustic tokens/second throughput
- Multi-speaker, multi-turn: single reference audio with multiple speakers can be disambiguated via
<|speaker:i|>tokens; context window carries across dialogue turns - Voice cloning: 10-30 second reference samples claimed sufficient without fine-tuning
Caveats
- License is custom, not open source: released under the “Fish Audio Research License” — commercial use restrictions apply, and the authors state they “will take action against any violation”
- Hardware expectations: benchmark numbers cited are from an NVIDIA H200; no explicit minimum specs for local inference, but 4B + 400M parameters plus 10-codebook RVQ decoding implies significant GPU memory requirements
- Self-reported benchmarks: all comparison numbers come from the project’s own technical report; no independent third-party evaluation is referenced in the README
Verdict
Worth exploring if you need expressive, multilingual TTS with fine-grained prosodic control and already run GPU inference infrastructure. Skip if you need permissively licensed weights for commercial products, or if you’re hoping to run quality TTS on consumer hardware without optimization work.