A Chinese speech toolkit that actually ships working models
Parrots wraps ASR and TTS into pip-installable Python with pre-trained voices and emotional fine-tuning.

What it does
Parrots is a Python toolkit for speech recognition and synthesis with a clear focus on Chinese, English, and Japanese. It packages distilwhisper for ASR and GPT-SoVITS for TTS into one-liner initializations, plus a newer IndexTTS2 model that adds emotional control. You can pip-install it, point at a HuggingFace speaker model, and generate audio without training anything yourself.
The interesting bit
The emotional control in IndexTTS2 is unusually granular. You can feed it a separate emotion reference audio, tweak an 8-dimensional emotion vector (happy, angry, sad, scared, disgusted, gloomy, surprised, calm), or let it infer mood from the text itself. There’s even pinyin mixing for precise pronunciation control in Chinese — useful when standard characters produce ambiguous readings.
Key highlights
- One-line ASR:
SpeechRecognition().recognize_speech_from_file("foo.wav")returns{"text": "..."} - Pre-trained speaker personas including “singing female anchor” and “game male anchor” voices
- Streaming TTS with configurable chunk size for real-time scenarios
- CLI entry points:
parrots asr file.wavandparrots tts "text" out.wav - Emotion decoupled from speaker identity: same voice, different moods via
emo_alphaor reference audio
Caveats
- The README claims “high accuracy” but offers no benchmarks or comparison numbers
- Pinyin control is explicitly noted as not supporting all possible pinyin combinations
- Setup still requires manual PyTorch install before
pip install parrots
Verdict Worth a look if you need Mandarin TTS with emotional range and don’t want to wrestle with model checkpoints yourself. Skip if you need rigorous accuracy metrics or production SLAs — this is a convenience wrapper, not a research benchmark.