Teaching 3B models to Google things for themselves
An open-source RL framework that trains small LLMs to interleave reasoning with live search engine calls.

What it does Search-R1 is a reinforcement learning training framework built on veRL that teaches language models to reason and call search engines in alternating turns. The model learns via standard RL methods (PPO, GRPO, REINFORCE) with rule-based outcome rewards—no supervised fine-tuning on tool use required. It supports local retrievers (BM25, dense with FAISS), online APIs (Google, Bing, Brave), and models from Llama-3.2-3B up to 30B+ scales.
The interesting bit The authors show that a raw 3B base model can spontaneously learn when to search and how to incorporate results, given only question-answer pairs and a reward for correctness. The search engine runs as a separate server; the LLM calls it via HTTP API mid-generation, making the architecture cleanly modular.
Key highlights
- Trains on base models (not instruct-tuned): Llama-3.2-3B and Qwen2.5-3B/7B shown in papers
- Multi-turn search-and-reasoning emerges from RL alone, per preliminary W&B logs
- Swap retrievers or swap in Google/Bing without touching the training code
- Multinode training supported for 30B+ parameter models
- Two published papers with full experiment logs (v0.1 through v0.3) and HuggingFace model releases
Caveats
- Requires separate conda environments for training and retrieval, with specific PyTorch/CUDA versions
- Quick-start demo uses Wikipedia + E5 retriever; bringing your own corpus needs manual FAISS indexing
- README notes vLLM version pinning (0.6.3 or older) that may conflict with newer setups
Verdict Worth a look if you’re researching tool-augmented reasoning or need an open alternative to closed DeepResearch pipelines. Skip if you just want a drop-in chatbot—this is a research training framework, not a finished product.