Voice cloning without the adversarial drama
A 2019 voice conversion system that learned to swap speakers using only autoencoder loss—no GANs, no parallel data, no fuss.

What it does
AutoVC converts speech from one speaker to another without needing recordings of the same sentence from both people. Feed it a mel-spectrogram and a target speaker embedding, and it reshapes the voice while preserving the words and prosody. The repo includes pre-trained models, Jupyter notebooks for conversion and vocoding, and a tiny verification dataset.
The interesting bit
The zero-shot claim is the hook: the model supposedly generalizes to speakers it never heard during training, using only an autoencoder loss rather than adversarial training. The authors also ship a HiFi-GAN alternative to the original WaveNet vocoder, which saves you from the molasses-slow neural vocoding era.
Key highlights
- PyTorch implementation with pre-trained weights for the converter, speaker encoder, and vocoder
- Supports both GE2E embeddings (zero-shot) and one-hot vectors (closed speaker set)
- Includes HiFi-GAN v1 weights for faster waveform generation
- Training converges at reconstruction loss ~0.0001, per the README
- Paper accepted at ICML 2019; audio demo available
Caveats
- The bundled wav data is “very small” and explicitly for code verification only—you bring your own dataset
- Training/testing metadata formats differ, which is a footgun waiting to happen
- Dependencies include PyTorch ≥0.4.1 and TensorFlow ≥1.3 (the latter only for TensorBoard, but still)
Verdict
Worth a look if you’re researching voice conversion or need a baseline that predates the diffusion/vocoder-heavy modern stack. Skip it if you want turnkey voice cloning for production; this is research code with 2019 ergonomics.