Is LLaVA-OneVision-2 open source?

Yes — EvolvingLMMs-Lab/LLaVA-OneVision-2 is open source, released under the Apache-2.0 license.

What language is LLaVA-OneVision-2 written in?

EvolvingLMMs-Lab/LLaVA-OneVision-2 is primarily written in Python.

How popular is LLaVA-OneVision-2?

EvolvingLMMs-Lab/LLaVA-OneVision-2 has 1.1k stars on GitHub.

Where can I find LLaVA-OneVision-2?

EvolvingLMMs-Lab/LLaVA-OneVision-2 is on GitHub at https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2.

← all repositories

EvolvingLMMs-Lab/LLaVA-OneVision-2

An 8B vision model that treats video like HEVC, not a flipbook

LLaVA-OneVision-2 borrows codec compression logic to stretch video understanding across longer timelines without ballooning token budgets.

★1.1k stars Python Language Models Image · Video · Audio ML Frameworks

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

LLaVA-OneVision-2 is an 8B-parameter open multimodal model that handles images, long-form video, and spatial 3D reasoning through one encoder and one training pipeline. The project ships the full stack: model weights, training code, configs, datasets, and logs. No gated API, no missing pieces.

The interesting bit

The vision encoder steals from video compression. Instead of uniformly sampling sparse frames and processing every patch — mostly static background — it keeps I-frames dense and selectively feeds only motion- and residual-rich patches from P-frames into the transformer. Same 54-token budget, 18 frames versus 6: triple the temporal coverage without touching the LLM context window. Image, uniform-frame video, and codec-aligned tokens all share one encoder path with a unified (t, h, w) position scheme; no modality-specific routers or hidden adapters.

Key highlights

Codec-aligned encoders: OneVision-Encoder and OneVision-Encoder-Lang support image, uniform-frame video, and HEVC-style codec stream inputs natively.
Four-stage curriculum: bootstraps from LLaVA-OneVision-1.5, then scales through 30s captions, 30–180s instruction tuning, 10–15 minute long-video extension, and final refinement on spatial reasoning and point tracking.
Four released datasets: includes new LLaVA-OneVision-2-VideoCaption (dense video captions) and LLaVA-OneVision-2-Spatial (3D-aware spatial reasoning), plus the 1.5-era 85M mid-training corpus and instruction mixture.
Fully open by intent: weights, training code, configs, and logs published; the README explicitly contrasts this against “most ‘open’ releases.”
8B Instruct model available now on HuggingFace; 4B variant and training logs for the 2.x series marked “Coming soon.”

Caveats

Training logs for the 2.x models are not yet released (badged “Coming soon”), so end-to-end reproducibility claims are partially prospective.
The 4B Instruct model is also pending release; only the 8B is downloadable today.

Verdict

Researchers building on multimodal foundations or anyone tired of black-box “open” models should grab this. If you just need a quick API call for image captioning, the setup overhead won’t pay off.

Frequently asked

What is EvolvingLMMs-Lab/LLaVA-OneVision-2?: LLaVA-OneVision-2 borrows codec compression logic to stretch video understanding across longer timelines without ballooning token budgets.
Is LLaVA-OneVision-2 open source?: Yes — EvolvingLMMs-Lab/LLaVA-OneVision-2 is open source, released under the Apache-2.0 license.
What language is LLaVA-OneVision-2 written in?: EvolvingLMMs-Lab/LLaVA-OneVision-2 is primarily written in Python.
How popular is LLaVA-OneVision-2?: EvolvingLMMs-Lab/LLaVA-OneVision-2 has 1.1k stars on GitHub.
Where can I find LLaVA-OneVision-2?: EvolvingLMMs-Lab/LLaVA-OneVision-2 is on GitHub at https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2.