A saner API for training LLMs with reinforcement learning
TextRL wraps HuggingFace's TRL library in a single config dataclass and a handful of trainer classes so you can run GRPO, DPO, or KTO without drowning in boilerplate.

What it does
TextRL is a thin layer over HuggingFace TRL that standardizes how you configure and run RLHF-style training. You define a TextRLConfig dataclass, pick a trainer (OnlineTrainer, PreferenceTrainer, RewardModelTrainer), and pass a callable reward function or a preference dataset. It handles PEFT/LoRA, 4-bit quantization, vLLM rollout for GRPO, and distributed training via accelerate without adding its own scaffolding.
The interesting bit
The reward function API is deliberately plain Python: decorate any callable with @reward_fn, compose multiple rewards with weights, or wrap a HuggingFace sentiment classifier in one line. No custom tensor formats, no subclassing gym environments. The v1.0 rewrite killed the old PFRL/gym API entirely—this is now purely a TRL ergonomic wrapper.
Key highlights
- One
TextRLConfigcovers GRPO, RLOO, REINFORCE++, DPO, IPO, KTO, and a dozen other algorithms via TRL’s unifiedloss_type. load_model()returns(policy, tokenizer, ref_model)with optional LoRA, QLoRA, and Flash Attention 2 in a single call.- vLLM rollout support for GRPO generation, gated behind
extra={"use_vllm": True}. - CLI tools for YAML-driven training, adapter merging, and reward-only evaluation.
- Explicit about what’s not supported: PPO, OnlineDPO, ORPO, SimPO, and others removed in TRL 0.29+ raise with a migration hint.
Caveats
- The project is basically glue code around TRL; if TRL breaks or removes an algorithm, TextRL breaks too.
- vLLM rollout is GRPO-only; RLOO and REINFORCE++ don’t get the fast path.
- The README notes a “v1.0 breaking change” with legacy API removal—check
docs/migration.mdif you’re upgrading.
Verdict Worth a look if you’re already in the TRL ecosystem and want less boilerplate, or if you train enough models that copy-pasting TRL scripts has become tedious. Skip it if you need PPO, SimPO, or deeply custom training loops that TRL itself doesn’t support.