EvolvingLMMs-Lab/LLaVA-OneVision-2
An open-source framework for training multimodal large language models that process vision, text, video, and spatial inputs.

Velocity · 7d
+4.0
★ / day
Trend
→steady
star history
LLaVA-OneVision-2 is a fully open multimodal training framework for building next-generation vision-language models. The repository provides model weights, training code, and datasets (including video captioning and spatial datasets) for developing multimodal LLMs. It integrates with Qwen3 as the language backbone and supports encoder-codec architectures for visual processing.