microsoft/SpeechT5
Microsoft's unified-modal speech-text pre-training framework implementing multiple speech processing models including ASR, TTS, speech translation, and speech language models.

SpeechT5 provides pre-training approaches for spoken language processing including SpeechT5 (encoder-decoder pre-training), Speech2C (ASR with unpaired speech), YiTrans (speech translation), SpeechUT (speech-text bridging), and VALL-E X (cross-lingual neural codec language modeling). The repository contains model implementations, evaluation results, and inference instructions for these speech-focused deep learning systems.