X-LANCE/SLAM-LLM
A training framework and toolkit for building custom multimodal LLMs that process speech, language, audio, and music.

SLAM-LLM is a deep learning framework that enables researchers and developers to train custom multimodal large language models (MLLMs) for speech, language, audio, and music processing tasks. It provides training recipes, PEFT support for efficient fine-tuning, and high-performance inference checkpoints. The framework supports multi-task training for ASR and speech translation, and scales to datasets with hundreds of thousands of hours of speech data.