← all repositories
walkinglabs/hands-on-modern-rl

From CartPole to GRPO: an RL course that runs the code first

An open curriculum that teaches reinforcement learning by starting with runnable experiments, then using the results to explain the math behind PPO, DPO, and LLM alignment.

2.7k stars Python LearningAgentsLanguage Models
hands-on-modern-rl
Velocity · 7d
+46
★ / day
Trend
steady
star history

What it does

Hands-On Modern RL is a free, practice-first course that tries to bridge a stubborn gap: most RL material is either toy-code tutorials or dense paper derivations, with little in between. The project provides runnable experiments—CartPole, Atari Pong, LLM post-training with DPO/GRPO, even VLM geometry reasoning—paired with explanations that introduce formulas only after you’ve seen the training curves collapse or converge. It is built as a VitePress site with a companion PDF, and the source lives in docs/ while the code lives in code/.

The interesting bit

The course treats debugging as core curriculum. Training collapse, reward hacking, KL drift, and OOM failures are not footnotes—they are chapter material. That is unusual for educational content, which typically shows sanitized success cases. The authors also explicitly note the content was AI-assisted and not fully reviewed, which is either refreshing honesty or a warning, depending on your tolerance for living dangerously.

Key highlights

  • Covers classic RL (DQN, REINFORCE, Actor-Critic, PPO) through modern LLM alignment (RLHF, DPO, GRPO, RLVR) and agentic systems
  • Includes reproducible labs for Agentic RL (Deep Research-style tool use) and VLM RL (GeoQA geometry reasoning)
  • Code maps connect formulas to implementations line-by-line
  • Training metric visualizations include failure signals, not just pretty convergences
  • Full English translation and PDF builds available as of May 2026
  • CC BY-NC-SA 4.0 license

Caveats

  • The course is actively evolving; chapters marked under construction may contain mistakes
  • The authors explicitly warn that AI-assisted creation means factual errors or broken code are possible
  • Compute resources are limited; they are actively seeking GPU donations to run experiments
  • Several roadmap items (Unity embodied RL, Diffusion RL) are not yet delivered as of the latest README

Verdict

Worth bookmarking if you are an ML engineer moving from supervised learning into RL, or an LLM practitioner who wants to understand why your GRPO run is exploding. Less useful if you need a polished, finished reference—this is a living courseware project, not a textbook.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.