yl4579/StyleTTS2
A text-to-speech model achieving human-level synthesis using style diffusion and adversarial training with large speech language models.

StyleTTS 2 is a deep learning TTS model that generates speech by modeling styles as latent variables through diffusion models and using large pre-trained speech language models (such as WavLM) as discriminators. It employs adversarial training with differentiable duration modeling for end-to-end training, enabling efficient synthesis without requiring reference speech. The model achieves human-level quality on single-speaker LJSpeech and multi-speaker VCTK datasets, and supports zero-shot speaker adaptation when trained on LibriTTS.