← all repositories

jaywalnut310/vits

A text-to-speech model using conditional variational autoencoders and adversarial learning to synthesize natural speech audio from text.

7.9k stars Python Image · Video · Audio
vits
Velocity · 7d
+4.3
★ / day
Trend
steady
star history

VITS is an end-to-end neural text-to-speech system that combines variational inference augmented with normalizing flows and adversarial training. It predicts mel-spectrograms from text input, which are then converted to audio waveforms via a vocoder. The model includes a stochastic duration predictor to capture the natural variation in speech rhythm and pitch across different utterances, expressing the one-to-many relationship in speech synthesis.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.