chi2liu/ABC-GRPO
ABC-GRPO is a reinforcement learning algorithm variant that introduces four independent clipping boundaries to improve stability and generalization when training LLMs like Qwen3 with GRPO.

The project implements Adaptive-Boundary-Clipping GRPO, an asymmetric refinement of the standard GRPO reinforcement learning algorithm for LLM training. It replaces GRPO’s two conditional clipping boundaries with four independent parameters (ε₁, ε₂, ε₃, ε₄) that provide unconditional bounds across all quadrants of the advantage space. The method maintains higher entropy during training to prevent premature convergence, and evaluation on mathematical reasoning tasks with Qwen3 models demonstrates superior performance over standard GRPO.