DAMO-NLP-SG/VideoLLaMA2
A multi-modal LLM that processes video and audio for spatial-temporal reasoning and understanding.

Velocity · 7d
+1.8
★ / day
Trend
→steady
star history
VideoLLaMA 2 is a video large language model that advances spatial-temporal modeling and audio understanding. It extends LLM capabilities to multi-modal video comprehension by combining visual, audio, and text inputs. The project provides model checkpoints, demo spaces on HuggingFace, and training/inference code for the video-LLM architecture.