DAMO-NLP-SG/VideoLLaMA3
VideoLLaMA3 is a multimodal LLM designed to understand images and videos via joint visual-language processing.

Velocity · 7d
+2.3
★ / day
Trend
→steady
star history
VideoLLaMA3 is a frontier multimodal foundation model from DAMO-NLP-SG that processes both images and videos alongside text for understanding tasks. It extends LLaMA-style language model architecture with visual encoders to enable video comprehension and image understanding. The project provides Hugging Face model checkpoints and interactive demos for both image and video understanding.