jaywalnut310/vits
A text-to-speech model using conditional variational autoencoders and adversarial learning to synthesize natural speech audio from text.

VITS is an end-to-end neural text-to-speech system that combines variational inference augmented with normalizing flows and adversarial training. It predicts mel-spectrograms from text input, which are then converted to audio waveforms via a vocoder. The model includes a stochastic duration predictor to capture the natural variation in speech rhythm and pitch across different utterances, expressing the one-to-many relationship in speech synthesis.