← all repositories
CorentinJ/Real-Time-Voice-Cloning

A master's thesis that 60K stars later admits it's outdated

The once-landmark real-time voice cloner now explicitly tells you to look elsewhere for quality.

59.9k stars Python Image · Video · Audio
Real-Time-Voice-Cloning
Velocity · 7d
+23
★ / day
Trend
steady
star history

What it does

Feed it a few seconds of someone’s voice and text of your choice; it generates new speech in that voice. The pipeline has three stages: a speaker encoder creates a voice embedding, a Tacotron synthesizer turns text into mel spectrograms conditioned on that embedding, and a WaveRNN vocoder renders audio in real-time.

The interesting bit

The author is admirably blunt: this repo has “quickly gotten old” and many paid SaaS offerings now sound better. It’s rare to see a popular open-source project steer users toward competitors and newer research. The thesis-born code has become a historical artifact with a maintenance update (now using uv for packaging) rather than a living SOTA project.

Key highlights

  • Implements SV2TTS, GE2E encoder, Tacotron, and WaveRNN from four separate papers
  • GUI toolbox (demo_toolbox.py) and headless CLI (demo_cli.py) both included
  • Pretrained models auto-download from Hugging Face; no manual hunting required
  • Supports Windows and Linux, with CPU fallback if no NVIDIA GPU
  • Author explicitly recommends Chatterbox for 2025-quality voice cloning

Caveats

  • Audio quality lags behind current SaaS and open-source alternatives
  • macOS not mentioned in supported platforms
  • Training your own models requires dataset wrangling (LibriSpeech recommended)

Verdict

Worth a spin if you need a local, offline voice cloning baseline or want to study the SV2TTS architecture hands-on. Skip it if you need production-grade output; the README itself will tell you where to go.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.