← all repositories
Jiayi-Pan/TinyZero

Watch a 3B model teach itself to think for $30

A minimal, reproducible setup that shows small language models can develop self-verification and search abilities through pure reinforcement learning.

13.1k stars Python Language ModelsML Frameworks
TinyZero
Velocity · 7d
+26
★ / day
Trend
steady
star history

What it does TinyZero reproduces DeepSeek R1-Zero’s core trick—teaching a base language model reasoning through reinforcement learning alone—on toy tasks like countdown and multiplication. It runs on 1–2 GPUs and costs under $30 in compute. The project is built as a thin layer atop the veRL library, with scripts and data preprocessing to get a Qwen2.5 model training quickly.

The interesting bit The “Aha moment” here is literal: the 3B base model spontaneously develops self-verification and search strategies without any supervised fine-tuning or chain-of-thought examples. The 0.5B model, notably, fails to learn reasoning at all—suggesting a capacity threshold that is itself a useful finding.

Key highlights

  • Reproduces R1-Zero’s emergent reasoning on countdown and multiplication tasks
  • 3B model learns sophisticated skills; 0.5B model does not (a built-in ablation)
  • Single-GPU support for models ≤1.5B, dual-GPU for 3B+
  • Includes instruct-model ablation with chat-template data preprocessing
  • Full experiment logs public on Weights & Biases

Caveats

  • Repository is deprecated and no longer maintained; authors direct users to upstream veRL for new RL experiments
  • Out-of-VRAM issues reported; gradient checkpointing may be needed
  • “Multiplication tasks” are mentioned but only countdown training is documented in the README

Verdict Worth a look if you want to witness emergent reasoning in a controlled, cheap setup—or if you’re skeptical that R1-Zero’s results scale down. Skip it if you need active maintenance; use veRL directly instead.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.