← all repositories

policy-gradient/GRPO-Zero

A minimal implementation of GRPO (Group Relative Policy Optimization) for training large language models with reinforcement learning.

1.9k stars Python Language ModelsML Frameworks
GRPO-Zero
Velocity · 7d
+4.4
★ / day
Trend
steady
star history

This repository implements DeepSeek R1’s GRPO training algorithm from scratch, supporting token-level policy gradient loss and improvements from the DAPO project. It removes the KL divergence term and value estimation network to reduce GPU memory usage, allowing training on a single A40 GPU with 48GB VRAM. The implementation only depends on tokenizers and PyTorch, avoiding transformers and vLLM dependencies.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.