Clone a voice with 10 minutes of audio and a Gradio tab
RVC wraps VITS-based voice conversion in a web UI so you can train a model on a lunch break's worth of recordings.

What it does RVC is a voice-conversion toolkit built on VITS. You feed it roughly 10 minutes of clean speech, and it learns to map someone else’s voice (or your own pitched-up version) into that timbre. The project ships a Gradio web UI for training and inference, plus a separate real-time GUI that claims 170 ms end-to-end latency—or 90 ms if you have ASIO hardware and the patience to configure it.
The interesting bit Instead of letting the model hallucinate timbre from scratch, RVC retrieves the closest matching feature from the training set and swaps it in. The README calls this “top1 retrieval to eliminate timbre leakage”; practically, it means the output voice stays closer to your target dataset rather than drifting. It also bundles RMVPE, a pitch-extraction model from Interspeech 2023, which the authors say beats crepe_full on quality while being faster and lighter.
Key highlights
- Low data floor: the docs suggest 10 minutes of low-noise audio is enough to start.
- Cross-vendor GPU support: separate requirements files for NVIDIA (CUDA), AMD (DirectML / ROCm on Linux), and Intel (IPEX on Linux).
- Pre-trained base model trained on ~50 hours of VCTK, which the project notes is open-source and “no copyright concerns.”
- Built-in vocal separation via UVR5 weights, so you can strip accompaniment before training.
- Model fusion via checkpoint merging in the UI.
Caveats
- Real-time latency is hardware-dependent; the 90 ms figure requires ASIO drivers that “very much depend on hardware support.”
- You still need to manually download several pre-trained assets (Hubert base, UVR5 weights, optional RMVPE files) from Hugging Face before the UI will run.
- The README is primarily in Chinese; English docs exist but may lag.
Verdict Worth a look if you need quick voice cloning for content creation, game modding, or accessibility projects and don’t want to wrangle So-VITS-SVC manually. Skip it if you need production-grade TTS or zero-latency streaming without ASIO hardware.