← all repositories

NVlabs/GDPO

GDPO is a reinforcement learning training method for improving LLMs on math reasoning, code generation, and tool-calling tasks.

GDPO
Velocity · 7d
+2.6
★ / day
Trend
steady
star history

GDPO addresses reward advantage collapse in Group Relative Policy Optimization (GRPO) when handling multiple rewards. It decouples reward normalization across individual rewards to preserve relative differences and enable more stable LLM fine-tuning. The implementation integrates with major RL training frameworks including TRL, VERL, and NeMo-RL, providing SLURM-free training scripts that can run on 8xA100 GPUs in approximately 1 hour.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.