A TTS model that speaks 9 languages and 18 Chinese dialects
CosyVoice treats speech synthesis as a language-modeling problem, with enough deployment options to satisfy both researchers and production engineers.

What it does CosyVoice is a multilingual text-to-speech system built on large language models. The latest Fun-CosyVoice 3.0 handles nine languages plus more than 18 Chinese dialects and accents, supports zero-shot voice cloning, and can follow instructions for emotion, speed, and volume. The repo ships with training scripts, inference code, and Dockerized deployment runtimes.
The interesting bit The project doesn’t just dump a model checkpoint and leave. It offers vLLM acceleration, TensorRT-LLM for 4× speedup, and bi-streaming with latency down to 150 ms. That’s an unusually complete stack for an open-source audio project.
Key highlights
- 0.5B parameter model competitive with larger closed-source rivals on character error rate and speaker similarity
- Pronunciation inpainting via Chinese Pinyin or English CMU phonemes for fine-grained control
- Text normalization built in, no traditional frontend required
- Streaming inference with KV-cache and SDPA optimization
- Docker images for FastAPI, gRPC, and Triton/TensorRT-LLM deployment
Caveats
- vLLM support is picky: only 0.9.0 or 0.11.x+ work, and the README warns you may need a separate conda environment to avoid corrupting your main one
- The ttsfrd text-normalization package is optional but recommended, and the wheel is platform-specific (cp310, linux_x86_64)
Verdict Worth a look if you need production-grade multilingual TTS with voice cloning, especially for Chinese dialects. Skip it if you want a lightweight, dependency-free drop-in; this is a full-stack research-to-deployment system with the complexity that implies.