← all repositories
netease-youdao/Confucius4-TTS

Your Voice in Mandarin, No Transcript Required

Most zero-shot TTS tools still demand reference transcripts or stumble across languages; this one claims to do neither.

519 stars Python Image · Video · Audio
Confucius4-TTS
Collecting fresh signals — velocity needs a few days of history.
collecting data…
star history

What it does

Confucius4-TTS is a text-to-speech system built by NetEase Youdao that clones a speaker’s voice and renders speech in any of 14 supported languages. It pairs a speech encoder with a large language model to generate semantic tokens, then feeds those tokens to a flow-matching acoustic model to produce the final audio. The project aims to preserve both speaker identity and emotional tone even when the target language differs from the source.

The interesting bit

The heavy lifting is split between two trained modules: an LLM-based Text2Semantic stage that reasons over language and speaker conditioning, and a Semantic2Acoustic flow-matching decoder that renders mel spectrograms. Rather than reinventing every layer, it integrates existing models—Facebook’s Wav2Vec2-BERT and Amphion’s MaskGCT—into a single pipeline. The authors’ benchmark tables suggest this assembly often beats better-known rivals on cross-lingual word-error rate, though not always on speaker similarity.

Key highlights

  • Zero-shot voice cloning across 14 languages including Chinese, English, Japanese, Korean, and German, with no reference transcript required.
  • Competitive word-error rates on CV3-eval and X-Voice cross-lingual benchmarks compared to CosyVoice, F5-TTS, and ElevenLabs.
  • Two-stage architecture: LLM-driven Text2Semantic generation followed by flow-matching Semantic2Acoustic decoding.
  • Supports emotion transfer and unconstrained cloning from raw audio prompts.
  • Apache 2.0 licensed, with pretrained weights hosted on Hugging Face and ModelScope.

Caveats

  • Speaker similarity (SIM) scores in the published tables sometimes trail competitors like OmniVoice or VoxCPM2, suggesting a trade-off between intelligibility and voice fidelity.
  • Requires external model downloads—Wav2Vec2-BERT, Amphion MaskGCT, and CAMPPlus—so it is not a self-contained single checkpoint.
  • Benchmark data is self-reported by the authors.

Verdict

Try it if you need a research-friendly, Apache-licensed TTS system that handles cross-lingual zero-shot cloning without transcript hassles. Look elsewhere if your top priority is maximum speaker similarity or a one-click, dependency-free deployment.

Frequently asked

What is netease-youdao/Confucius4-TTS?
Most zero-shot TTS tools still demand reference transcripts or stumble across languages; this one claims to do neither.
Is Confucius4-TTS open source?
Yes — netease-youdao/Confucius4-TTS is an open-source project tracked on heatdrop.
What language is Confucius4-TTS written in?
netease-youdao/Confucius4-TTS is primarily written in Python.
How popular is Confucius4-TTS?
netease-youdao/Confucius4-TTS has 519 stars on GitHub.
Where can I find Confucius4-TTS?
netease-youdao/Confucius4-TTS is on GitHub at https://github.com/netease-youdao/Confucius4-TTS.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.