Watch a 3B model teach itself to think for $30
A minimal, reproducible setup that shows small language models can develop self-verification and search abilities through pure reinforcement learning.

What it does TinyZero reproduces DeepSeek R1-Zero’s core trick—teaching a base language model reasoning through reinforcement learning alone—on toy tasks like countdown and multiplication. It runs on 1–2 GPUs and costs under $30 in compute. The project is built as a thin layer atop the veRL library, with scripts and data preprocessing to get a Qwen2.5 model training quickly.
The interesting bit The “Aha moment” here is literal: the 3B base model spontaneously develops self-verification and search strategies without any supervised fine-tuning or chain-of-thought examples. The 0.5B model, notably, fails to learn reasoning at all—suggesting a capacity threshold that is itself a useful finding.
Key highlights
- Reproduces R1-Zero’s emergent reasoning on countdown and multiplication tasks
- 3B model learns sophisticated skills; 0.5B model does not (a built-in ablation)
- Single-GPU support for models ≤1.5B, dual-GPU for 3B+
- Includes instruct-model ablation with chat-template data preprocessing
- Full experiment logs public on Weights & Biases
Caveats
- Repository is deprecated and no longer maintained; authors direct users to upstream veRL for new RL experiments
- Out-of-VRAM issues reported; gradient checkpointing may be needed
- “Multiplication tasks” are mentioned but only countdown training is documented in the README
Verdict Worth a look if you want to witness emergent reasoning in a controlled, cheap setup—or if you’re skeptical that R1-Zero’s results scale down. Skip it if you need active maintenance; use veRL directly instead.