NVlabs/GDPO
GDPO is a reinforcement learning training method for improving LLMs on math reasoning, code generation, and tool-calling tasks.

Velocity · 7d
+2.6
★ / day
Trend
→steady
star history
GDPO addresses reward advantage collapse in Group Relative Policy Optimization (GRPO) when handling multiple rewards. It decouples reward normalization across individual rewards to preserve relative differences and enable more stable LLM fine-tuning. The implementation integrates with major RL training frameworks including TRL, VERL, and NeMo-RL, providing SLURM-free training scripts that can run on 8xA100 GPUs in approximately 1 hour.