← all repositories
voidful/TextRL

A saner API for training LLMs with reinforcement learning

TextRL wraps HuggingFace's TRL library in a single config dataclass and a handful of trainer classes so you can run GRPO, DPO, or KTO without drowning in boilerplate.

TextRL
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does TextRL is a thin layer over HuggingFace TRL that standardizes how you configure and run RLHF-style training. You define a TextRLConfig dataclass, pick a trainer (OnlineTrainer, PreferenceTrainer, RewardModelTrainer), and pass a callable reward function or a preference dataset. It handles PEFT/LoRA, 4-bit quantization, vLLM rollout for GRPO, and distributed training via accelerate without adding its own scaffolding.

The interesting bit The reward function API is deliberately plain Python: decorate any callable with @reward_fn, compose multiple rewards with weights, or wrap a HuggingFace sentiment classifier in one line. No custom tensor formats, no subclassing gym environments. The v1.0 rewrite killed the old PFRL/gym API entirely—this is now purely a TRL ergonomic wrapper.

Key highlights

  • One TextRLConfig covers GRPO, RLOO, REINFORCE++, DPO, IPO, KTO, and a dozen other algorithms via TRL’s unified loss_type.
  • load_model() returns (policy, tokenizer, ref_model) with optional LoRA, QLoRA, and Flash Attention 2 in a single call.
  • vLLM rollout support for GRPO generation, gated behind extra={"use_vllm": True}.
  • CLI tools for YAML-driven training, adapter merging, and reward-only evaluation.
  • Explicit about what’s not supported: PPO, OnlineDPO, ORPO, SimPO, and others removed in TRL 0.29+ raise with a migration hint.

Caveats

  • The project is basically glue code around TRL; if TRL breaks or removes an algorithm, TextRL breaks too.
  • vLLM rollout is GRPO-only; RLOO and REINFORCE++ don’t get the fast path.
  • The README notes a “v1.0 breaking change” with legacy API removal—check docs/migration.md if you’re upgrading.

Verdict Worth a look if you’re already in the TRL ecosystem and want less boilerplate, or if you train enough models that copy-pasting TRL scripts has become tedious. Skip it if you need PPO, SimPO, or deeply custom training loops that TRL itself doesn’t support.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.