Camb-ai/MARS5-TTS
A neural text-to-speech model that generates speech with natural prosody from text and a short reference audio clip.

Velocity · 7d
+3.8
★ / day
Trend
→steady
star history
MARS5 is a speech synthesis model from CAMB.AI that uses a two-stage autoregressive and non-autoregressive pipeline. It takes text input and a reference audio clip as short as 5 seconds to generate speech, including prosodically challenging content like sports commentary and anime. The model architecture includes coarse speech feature encoding and is available on HuggingFace with Colab demo notebooks.