turningpoint-ai/VisualThinker-R1-Zero
Reinforcement learning post-training for visual reasoning that replicates DeepSeek-R1-Zero's emergent reasoning on a 2B multimodal model.

VisualThinker-R1-Zero applies GRPO-based reinforcement learning to train Qwen2-VL-2B on visual reasoning tasks without supervised fine-tuning or reward models. The project demonstrates emergent self-reflection and correction behaviors in visual reasoning, successfully reproducing the ‘aha moment’ and increasing response length observed in DeepSeek-R1-Zero. This enables reasoning capabilities to emerge from pure RL training on vision-centric tasks.