PKU-YuanGroup/Video-LLaVA
A large vision-language model that processes video inputs by aligning visual representations before projection into the language model space.

Video-LLaVA is a multi-modal foundation model designed for video understanding and reasoning. It learns a unified visual representation by aligning video features before projecting them into the language model. The model supports instruction-tuned video understanding tasks, enabling question answering and reasoning over video content. It builds on the LLaVA architecture extended to handle temporal video information alongside language inputs.