Open-source o1 rival trained purely with reinforcement learning
DeepSeek-R1 proves you can teach LLMs to reason without spoon-feeding them curated examples first.
What it does DeepSeek-R1 is a 671-billion-parameter mixture-of-experts model that matches OpenAI’s o1 on math, code, and reasoning benchmarks. It comes in two flavors: R1-Zero, trained with reinforcement learning alone, and the full R1, which adds a small “cold-start” data phase to fix R1-Zero’s tendency to ramble, repeat itself, and mix languages mid-thought. The project also releases six smaller distilled variants (1.5B to 70B) built on Llama and Qwen.
The interesting bit R1-Zero is the first open result showing that reasoning capabilities—self-verification, reflection, long chain-of-thought—can emerge purely from RL incentives without supervised fine-tuning as scaffolding. The distillation story is equally notable: a 32B parameter dense model beats o1-mini on several benchmarks, suggesting the big model’s reasoning patterns transfer better than trying to train small models from scratch with RL.
Key highlights
- 671B total params, 37B active per forward pass, 128K context window
- Benchmark table shows R1 topping or matching o1-1217 on MATH-500 (97.3%), AIME 2024 (79.8%), and LiveCodeBench (65.9%)
- Distilled Qwen-32B outperforms o1-mini on AIME 2024 (72.6% vs 63.6%)
- MIT licensed; weights and paper available on HuggingFace
- OpenAI-compatible API and web chat at deepseek.com
Caveats
- Running the full R1 locally requires the DeepSeek-V3 infrastructure; the README warns that standard HuggingFace Transformers support is incomplete
- Distilled models use modified configs and tokenizers—you can’t just drop them into existing Llama/Qwen pipelines blindly
- R1-Zero’s raw output is described as poorly readable and prone to endless repetition; the “interesting” version is R1 with its cold-start fix
Verdict Worth your time if you’re researching RL-driven reasoning or need a locally runnable strong reasoner via the distilled variants. Skip if you’re looking for a plug-and-play drop-in replacement for your existing Llama tooling without reading the setup notes.