lsdefine/simple_GRPO
A minimal GRPO implementation for training LLMs with reinforcement learning to achieve r1-style reasoning capabilities.

Velocity · 7d
+3.5
★ / day
Trend
→steady
star history
This repository provides a simple implementation of Group Relative Policy Optimization (GRPO) for training large language models to exhibit reasoning behaviors similar to r1-style models. It includes support for vLLM inference acceleration, split reference models across GPUs, and memory-efficient training on single A800 GPUs. The codebase is designed for educational purposes and experimentation with RL training pipelines for LLMs.