← all repositories

DAMO-NLP-SG/Video-LLaMA

An audio-visual language model that extends LLaMA-2 to understand video and audio content in response to natural language instructions.

Video-LLaMA
Velocity · 7d
+2.8
★ / day
Trend
steady
star history

Video-LLaMA is an instruction-tuned multi-modal LLM that enables large language models to process video frames, audio tracks, and text simultaneously for video understanding tasks. It builds on LLaMA-2 as the language decoder and combines visual and audio encoders with a projection mechanism to bridge modalities. The model was trained using cross-modal pretraining techniques inspired by BLIP-2 and MiniGPT-4 to align visual/audio representations with the language space.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.