DAMO-NLP-SG/Video-LLaMA
An audio-visual language model that extends LLaMA-2 to understand video and audio content in response to natural language instructions.

Video-LLaMA is an instruction-tuned multi-modal LLM that enables large language models to process video frames, audio tracks, and text simultaneously for video understanding tasks. It builds on LLaMA-2 as the language decoder and combines visual and audio encoders with a projection mechanism to bridge modalities. The model was trained using cross-modal pretraining techniques inspired by BLIP-2 and MiniGPT-4 to align visual/audio representations with the language space.