bytedance/SALMONN
ByteDance/Tsinghua research family of multi-modal LLMs processing speech, audio, and video through unified architectures.

SALMONN is a suite of multi-modal large language models developed by ByteDance and Tsinghua University that process and understand audio, speech, video, and text in a unified framework. The project includes multiple specialized variants such as video-SALMONN for audio-visual understanding, ELLSA for streaming full-duplex multimodal perception, and speech quality assessment models. Each branch provides model weights and inference code, with research published at major ML venues including ICLR and ICML.