A master's thesis that 60K stars later admits it's outdated
The once-landmark real-time voice cloner now explicitly tells you to look elsewhere for quality.

What it does
Feed it a few seconds of someone’s voice and text of your choice; it generates new speech in that voice. The pipeline has three stages: a speaker encoder creates a voice embedding, a Tacotron synthesizer turns text into mel spectrograms conditioned on that embedding, and a WaveRNN vocoder renders audio in real-time.
The interesting bit
The author is admirably blunt: this repo has “quickly gotten old” and many paid SaaS offerings now sound better. It’s rare to see a popular open-source project steer users toward competitors and newer research. The thesis-born code has become a historical artifact with a maintenance update (now using uv for packaging) rather than a living SOTA project.
Key highlights
- Implements SV2TTS, GE2E encoder, Tacotron, and WaveRNN from four separate papers
- GUI toolbox (
demo_toolbox.py) and headless CLI (demo_cli.py) both included - Pretrained models auto-download from Hugging Face; no manual hunting required
- Supports Windows and Linux, with CPU fallback if no NVIDIA GPU
- Author explicitly recommends Chatterbox for 2025-quality voice cloning
Caveats
- Audio quality lags behind current SaaS and open-source alternatives
- macOS not mentioned in supported platforms
- Training your own models requires dataset wrangling (LibriSpeech recommended)
Verdict
Worth a spin if you need a local, offline voice cloning baseline or want to study the SV2TTS architecture hands-on. Skip it if you need production-grade output; the README itself will tell you where to go.