← all repositories

DAMO-NLP-SG/VideoLLaMA2

A multi-modal LLM that processes video and audio for spatial-temporal reasoning and understanding.

VideoLLaMA2
Velocity · 7d
+1.8
★ / day
Trend
steady
star history

VideoLLaMA 2 is a video large language model that advances spatial-temporal modeling and audio understanding. It extends LLM capabilities to multi-modal video comprehension by combining visual, audio, and text inputs. The project provides model checkpoints, demo spaces on HuggingFace, and training/inference code for the video-LLM architecture.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.