From CartPole to GRPO: an RL course that runs the code first
An open curriculum that teaches reinforcement learning by starting with runnable experiments, then using the results to explain the math behind PPO, DPO, and LLM alignment.

What it does
Hands-On Modern RL is a free, practice-first course that tries to bridge a stubborn gap: most RL material is either toy-code tutorials or dense paper derivations, with little in between. The project provides runnable experiments—CartPole, Atari Pong, LLM post-training with DPO/GRPO, even VLM geometry reasoning—paired with explanations that introduce formulas only after you’ve seen the training curves collapse or converge. It is built as a VitePress site with a companion PDF, and the source lives in docs/ while the code lives in code/.
The interesting bit
The course treats debugging as core curriculum. Training collapse, reward hacking, KL drift, and OOM failures are not footnotes—they are chapter material. That is unusual for educational content, which typically shows sanitized success cases. The authors also explicitly note the content was AI-assisted and not fully reviewed, which is either refreshing honesty or a warning, depending on your tolerance for living dangerously.
Key highlights
- Covers classic RL (DQN, REINFORCE, Actor-Critic, PPO) through modern LLM alignment (RLHF, DPO, GRPO, RLVR) and agentic systems
- Includes reproducible labs for Agentic RL (Deep Research-style tool use) and VLM RL (GeoQA geometry reasoning)
- Code maps connect formulas to implementations line-by-line
- Training metric visualizations include failure signals, not just pretty convergences
- Full English translation and PDF builds available as of May 2026
- CC BY-NC-SA 4.0 license
Caveats
- The course is actively evolving; chapters marked under construction may contain mistakes
- The authors explicitly warn that AI-assisted creation means factual errors or broken code are possible
- Compute resources are limited; they are actively seeking GPU donations to run experiments
- Several roadmap items (Unity embodied RL, Diffusion RL) are not yet delivered as of the latest README
Verdict
Worth bookmarking if you are an ML engineer moving from supervised learning into RL, or an LLM practitioner who wants to understand why your GRPO run is exploding. Less useful if you need a polished, finished reference—this is a living courseware project, not a textbook.