Is CosyVoice open source?

Yes — FunAudioLLM/CosyVoice is open source, released under the Apache-2.0 license.

What language is CosyVoice written in?

FunAudioLLM/CosyVoice is primarily written in Python.

How popular is CosyVoice?

FunAudioLLM/CosyVoice has 22.4k stars on GitHub and is currently cooling off.

Where can I find CosyVoice?

FunAudioLLM/CosyVoice is on GitHub at https://github.com/FunAudioLLM/CosyVoice.

← all repositories

FunAudioLLM/CosyVoice

Open-source voice cloning for nine languages and eighteen dialects

CosyVoice provides open-source training, inference, and deployment tools for zero-shot multilingual speech synthesis using large language models.

★22.4k stars Python Image · Video · Audio

View on GitHub ↗ Homepage ↗

Velocity · 7d

+27

★ / day

Trend

↘cooling

star history

What it does

CosyVoice is a text-to-speech system built on large language models. It generates speech from text in nine languages and more than eighteen Chinese regional dialects, and can clone a speaker’s voice from a short audio sample without prior training on that voice. The project ships with pretrained 0.5B-parameter models, training scripts, and runtime integrations for both research and production deployment.

The interesting bit

Instead of treating speech as a signal-processing problem, CosyVoice treats it as a sequence-modeling task, borrowing heavily from the LLM playbook—complete with vLLM serving, TensorRT-LLM acceleration, and repetition-aware sampling. The latest release adds pronunciation inpainting, letting you patch specific phonemes in Chinese Pinyin or English CMU format when the model misreads a word.

Key highlights

Covers nine languages plus cross-lingual zero-shot voice cloning, with granular control over emotions, speed, volume, and dialect.
Bi-streaming architecture claims end-to-end latency as low as 150ms while maintaining audio quality.
Optional text normalization via ttsfrd handles numbers and symbols without a traditional frontend; falls back to WeTextProcessing if unavailable.
Supports deployment through vLLM, FastAPI, gRPC, and NVIDIA Triton with TensorRT-LLM, which the project notes yields roughly 4× acceleration over plain Transformers inference.
Includes an evaluation suite and benchmark results comparing open and closed-source rivals on character error rate and speaker similarity.

Caveats

vLLM support is picky: only versions 0.9.0 and 0.11.x+ are confirmed to work, and the README warns that mismatched versions can corrupt your environment.
The optional ttsfrd normalization package ships as a Python 3.10 Linux x86_64 wheel, so non-Linux or newer-Python users are stuck with the fallback.
Much of the codebase is explicitly noted as borrowed from FunASR, FunCodec, Matcha-TTS, and WeNet—useful, but not a from-scratch stack.

Verdict

Worth a look if you need an open, controllable TTS pipeline with strong Chinese dialect support and production serving options. Skip it if you want a lightweight, dependency-free library or a fully novel architecture without upstream lineage.

Frequently asked

What is FunAudioLLM/CosyVoice?: CosyVoice provides open-source training, inference, and deployment tools for zero-shot multilingual speech synthesis using large language models.
Is CosyVoice open source?: Yes — FunAudioLLM/CosyVoice is open source, released under the Apache-2.0 license.
What language is CosyVoice written in?: FunAudioLLM/CosyVoice is primarily written in Python.
How popular is CosyVoice?: FunAudioLLM/CosyVoice has 22.4k stars on GitHub and is currently cooling off.
Where can I find CosyVoice?: FunAudioLLM/CosyVoice is on GitHub at https://github.com/FunAudioLLM/CosyVoice.