← all repositories

PKU-YuanGroup/Video-LLaVA

A large vision-language model that processes video inputs by aligning visual representations before projection into the language model space.

Video-LLaVA
Velocity · 7d
+3.6
★ / day
Trend
steady
star history

Video-LLaVA is a multi-modal foundation model designed for video understanding and reasoning. It learns a unified visual representation by aligning video features before projecting them into the language model. The model supports instruction-tuned video understanding tasks, enabling question answering and reasoning over video content. It builds on the LLaVA architecture extended to handle temporal video information alongside language inputs.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.