policy-gradient/GRPO-Zero
A minimal implementation of GRPO (Group Relative Policy Optimization) for training large language models with reinforcement learning.

Velocity · 7d
+4.4
★ / day
Trend
→steady
star history
This repository implements DeepSeek R1’s GRPO training algorithm from scratch, supporting token-level policy gradient loss and improvements from the DAPO project. It removes the KL divergence term and value estimation network to reduce GPU memory usage, allowing training on a single A40 GPU with 48GB VRAM. The implementation only depends on tokenizers and PyTorch, avoiding transformers and vLLM dependencies.