Is Video-LLaVA open source?

Yes — PKU-YuanGroup/Video-LLaVA is open source, released under the Apache-2.0 license.

What language is Video-LLaVA written in?

PKU-YuanGroup/Video-LLaVA is primarily written in Python.

How popular is Video-LLaVA?

PKU-YuanGroup/Video-LLaVA has 3.5k stars on GitHub.

Where can I find Video-LLaVA?

PKU-YuanGroup/Video-LLaVA is on GitHub at https://github.com/PKU-YuanGroup/Video-LLaVA.

← all repositories

PKU-YuanGroup/Video-LLaVA

One visual encoder lets LLMs parse both memes and movies

Video-LLaVA aligns image and video features into one shared language space, so a 7B LLM can reason across both modalities without paired image-video training data.

★3.5k stars Python Language Models Image · Video · Audio

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Video-LLaVA is a 7B-parameter vision-language model that answers questions and reasons about both still images and video clips through a single, unified visual backbone. It wraps an LLM with encoders whose outputs are bound to the language feature space, then provides Gradio and CLI interfaces, plus a first-class HuggingFace Transformers integration.

The interesting bit

The trick is “alignment before projection”: image and video features are mapped into a shared representation space before hitting the language model, so one projection layer handles both. The authors note this cross-modal capability emerges despite the training set containing no image-video pairs, and they argue the modalities are complementary rather than competing.

Key highlights

Accepted at EMNLP 2024 with a meta score of 4
Single unified visual representation serves both images and videos simultaneously
Ships with 4-bit CLI inference, Gradio web UI, and HuggingFace Transformers support
LoRA fine-tuning scripts included; community forks have adapted it for cinematic narrative analysis (CinePile)
Demo deployments on Replicate, OpenXLab, and ModelScope

Verdict

A solid starting point for researchers and hackers who want one open-weight model that handles both image and video understanding; pure-image or pure-video specialists may find the 7B footprint overkill if the unified angle isn’t needed.

Frequently asked

What is PKU-YuanGroup/Video-LLaVA?: Video-LLaVA aligns image and video features into one shared language space, so a 7B LLM can reason across both modalities without paired image-video training data.
Is Video-LLaVA open source?: Yes — PKU-YuanGroup/Video-LLaVA is open source, released under the Apache-2.0 license.
What language is Video-LLaVA written in?: PKU-YuanGroup/Video-LLaVA is primarily written in Python.
How popular is Video-LLaVA?: PKU-YuanGroup/Video-LLaVA has 3.5k stars on GitHub.
Where can I find Video-LLaVA?: PKU-YuanGroup/Video-LLaVA is on GitHub at https://github.com/PKU-YuanGroup/Video-LLaVA.