← all repositories
auspicious3000/autovc

Voice cloning without the adversarial drama

A 2019 voice conversion system that learned to swap speakers using only autoencoder loss—no GANs, no parallel data, no fuss.

1.1k stars Python Image · Video · Audio
autovc
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

AutoVC converts speech from one speaker to another without needing recordings of the same sentence from both people. Feed it a mel-spectrogram and a target speaker embedding, and it reshapes the voice while preserving the words and prosody. The repo includes pre-trained models, Jupyter notebooks for conversion and vocoding, and a tiny verification dataset.

The interesting bit

The zero-shot claim is the hook: the model supposedly generalizes to speakers it never heard during training, using only an autoencoder loss rather than adversarial training. The authors also ship a HiFi-GAN alternative to the original WaveNet vocoder, which saves you from the molasses-slow neural vocoding era.

Key highlights

  • PyTorch implementation with pre-trained weights for the converter, speaker encoder, and vocoder
  • Supports both GE2E embeddings (zero-shot) and one-hot vectors (closed speaker set)
  • Includes HiFi-GAN v1 weights for faster waveform generation
  • Training converges at reconstruction loss ~0.0001, per the README
  • Paper accepted at ICML 2019; audio demo available

Caveats

  • The bundled wav data is “very small” and explicitly for code verification only—you bring your own dataset
  • Training/testing metadata formats differ, which is a footgun waiting to happen
  • Dependencies include PyTorch ≥0.4.1 and TensorFlow ≥1.3 (the latter only for TensorBoard, but still)

Verdict

Worth a look if you’re researching voice conversion or need a baseline that predates the diffusion/vocoder-heavy modern stack. Skip it if you want turnkey voice cloning for production; this is research code with 2019 ergonomics.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.