← all repositories
Gen-Verse/OpenClaw-RL

Your chatbot learns while you argue with it

An async RL framework that turns live conversations into training signals without taking the agent offline.

5.5k stars Python AgentsLLMOps · Eval
OpenClaw-RL
Velocity · 7d
+54
★ / day
Trend
steady
star history

What it does OpenClaw-RL wraps a self-hosted LLM as an OpenAI-compatible API, intercepts your multi-turn conversations, and continuously optimizes the policy in the background. It supports both personal agent tuning and scalable RL for terminal, GUI, SWE, and tool-call agents. The entire stack—policy, judge, trainer—runs on your own hardware.

The interesting bit The architecture is genuinely decoupled: four async loops for serving, rollout collection, PRM/judge evaluation, and policy training, none blocking the others. Most RL-for-LLM systems batch offline; this one learns from the next user message or tool output as a natural “next-state” reward signal while you keep chatting.

Key highlights

  • Three training paradigms: Binary RL (GRPO with scalar rewards), On-Policy Distillation (textual hints as token-level directional signals), and a hybrid combining both
  • Automatic trajectory construction: classifies turns into trainable “main-line” vs. non-trainable “side” conversations, applies majority-vote judging, and feeds ready samples to the trainer
  • Self-hosted and private: no third-party model APIs required; supports local GPU or cloud deployment via Tinker and Fireworks AI
  • LoRA training supported; Qwen3.5 (4B/9B/27B) added recently with multimodal support
  • Built on the Slime training framework; extensible via custom loss, rollout, and reward-model hooks without touching core code

Caveats

  • The “one line of code” cloud launch claim is referenced but not shown in the truncated README; actual setup complexity is unclear
  • Track 2 (general agents) is newer and less proven than the personal-agent track; roadmap shows “support more cloud services” still unchecked
  • Contribution docs suggest the project is still stabilizing conventions (e.g., “do not modify the core framework” implies boundary disputes happen)

Verdict Worth a look if you’re running a local agent and want it to improve from real usage without building a data pipeline. Skip if you need a polished, fully managed RL service—this is infrastructure you operate yourself.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.