Clone a voice with 5 seconds of audio, or 1 minute to make it stick
A Python TTS toolkit that turns tiny voice samples into usable speech synthesis without enterprise-grade data collection.

What it does GPT-SoVITS is a voice cloning and text-to-speech system that runs through a WebUI. Feed it a 5-second clip and it speaks in that voice immediately; spend 1 minute on fine-tuning and the resemblance gets noticeably tighter. It also handles cross-lingual inference—train on Chinese, speak Japanese, or vice versa across English, Korean, and Cantonese.
The interesting bit The project bundles the entire data pipeline: voice-accompaniment separation, automatic audio segmentation, Chinese ASR, and text labeling. Most TTS projects assume you already have a clean, labeled dataset. This one assumes you have a messy MP3 and a weekend.
Key highlights
- Zero-shot inference from 5 seconds of audio; few-shot fine-tuning from ~1 minute
- Cross-lingual synthesis across 5 languages without retraining the base model
- Integrated WebUI tools for dataset prep, including UVR5-based vocal separation
- RTF of 0.014 on an RTX 4090 (1400 words in ~3.4 seconds); CPU fallback available
- Pre-built Windows package, Colab notebook, Docker images, and HuggingFace demo
Caveats
- macOS GPU training is explicitly noted as “significantly lower quality”; CPU training is recommended instead
- Docker images lag behind the rapid commit pace, so local builds or code pulls may be needed
- Model setup involves multiple manual downloads and precise path placement unless the install script handles everything
Verdict Worth a look if you need custom TTS without a voice-actor budget or a 100-hour dataset. Skip it if you need battle-tested, fully managed APIs or are allergic to conda environments and manual model wrangling.