← all repositories

bytedance/SALMONN

ByteDance/Tsinghua research family of multi-modal LLMs processing speech, audio, and video through unified architectures.

SALMONN
Velocity · 7d
+1.4
★ / day
Trend
steady
star history

SALMONN is a suite of multi-modal large language models developed by ByteDance and Tsinghua University that process and understand audio, speech, video, and text in a unified framework. The project includes multiple specialized variants such as video-SALMONN for audio-visual understanding, ELLSA for streaming full-duplex multimodal perception, and speech quality assessment models. Each branch provides model weights and inference code, with research published at major ML venues including ICLR and ICML.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.