Is Vision-R1 open source?

Yes — Osilly/Vision-R1 is an open-source project tracked on heatdrop.

What language is Vision-R1 written in?

Osilly/Vision-R1 is primarily written in Python.

How popular is Vision-R1?

Osilly/Vision-R1 has 1.6k stars on GitHub.

Where can I find Vision-R1?

Osilly/Vision-R1 is on GitHub at https://github.com/Osilly/Vision-R1.

← all repositories

Osilly/Vision-R1

Teaching vision models to think by making them write longer

Vision-R1 applies DeepSeek-R1's reinforcement-learning recipe to multimodal models, with a staged training trick that gradually loosens the leash on reasoning length.

★1.6k stars Python Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Vision-R1 is a family of multimodal reasoning models (7B to 72B) built on Qwen2.5-VL. The authors apply GRPO reinforcement learning to teach the model to generate chain-of-thought reasoning for visual math problems, using a cold-start dataset bootstrapped via DeepSeek-R1 and a “modality bridging” pipeline that turns image-question pairs into text descriptions R1 can actually reason about.

The interesting bit

The core hack is PTST — Progressive Token Sequence Training — where context-length limits are deliberately relaxed in stages (4K → 8K → 16K tokens) with shrinking group sizes (16 → 8 → 4). The model first shortens its reasoning to “find the right thought process,” then progressively lengthens it. The 7B variant reportedly matches or beats 70B+ MLLMs on math benchmarks, though the 32B and 72B versions used additional RL data not available to the smaller model.

Key highlights

ICLR 2026 accepted; weights, datasets, and training code partially released on HuggingFace
Cold-start data generated by piping image captions through DeepSeek-R1 to produce multimodal CoT traces
“Aha moment” examples show the 7B model self-correcting and questioning its own visual interpretations
Training scripts provided for LLaMA-Factory (cold-start) and EasyR1/verl (RL phase)
Authors targeting 8-GPU training for the full pipeline in future work

Caveats

The README still says “datasets, code and weights will be released, stay tuned!” despite several releases already happening — timeline clarity is spotty
32B and 72B results use extra RL data, so the scaling story isn’t strictly controlled
Stage 3 of PTST (16K tokens) was skipped for the final model; the dotted line in the diagram is doing more work than the training did

Verdict

Worth watching if you’re trying to replicate R1-style reasoning in multimodal settings, especially given the staged-length training as a reproducible trick. Skip if you need a fully packaged, batteries-included training framework — this is still a research artifact with rough edges.

Frequently asked

What is Osilly/Vision-R1?: Vision-R1 applies DeepSeek-R1's reinforcement-learning recipe to multimodal models, with a staged training trick that gradually loosens the leash on reasoning length.
Is Vision-R1 open source?: Yes — Osilly/Vision-R1 is an open-source project tracked on heatdrop.
What language is Vision-R1 written in?: Osilly/Vision-R1 is primarily written in Python.
How popular is Vision-R1?: Osilly/Vision-R1 has 1.6k stars on GitHub.
Where can I find Vision-R1?: Osilly/Vision-R1 is on GitHub at https://github.com/Osilly/Vision-R1.