A full-stack factory for phone-tapping AI agents
ClawGUI unifies online RL training, standardized benchmarks, and real-device deployment for GUI agents in one modular framework.

What it does
ClawGUI is a research framework that handles the complete lifecycle of GUI agents: training them with online reinforcement learning, evaluating them against standardized benchmarks, and deploying them to control real Android, HarmonyOS, or iOS devices via natural language. It ships as five independent modules—RL, Eval, Agent, Skills, and an on-device App—each with its own environment and documentation.
The interesting bit
The framework replaces standard GRPO with GiGPO+PRM for fine-grained step-level rewards during training, and it actually runs the full “brain + agent” stack directly on a single phone via Shizuku—no desktop coordinator required. The training-free skill evolution system lets agents diagnose failures, revise structured skill packages, and reuse them across tasks without retraining.
Key highlights
- ClawGUI-RL: Parallel Docker Android emulators or real-device training with automatic failover and episode visualization
- ClawGUI-Eval: 6 benchmarks, 11+ models, 95.8% reproduction rate against official results for actually comparable numbers
- ClawGUI-Agent: Cross-platform device control through 12+ chat platforms with one-command evaluation (“benchmark qwen3vl on screenspot-pro”)
- ClawGUI-APP: Full phone-only deployment; brain LLM and phone agent run on-device, though the VLM still calls cloud APIs for now
- ClawGUI-2B: End-to-end validation—a 2B model trained entirely with this pipeline hits 17.1 MobileWorld SR vs. 11.1 baseline
Caveats
- Desktop and web online RL extensions are on the roadmap but not yet implemented
- On-device inference still relies on cloud APIs for the brain/VLM; fully local inference is future work
- Each module has independent environment setup—no unified install, so expect some assembly
Verdict
Researchers building or benchmarking GUI agents should grab this; it solves the “train in one repo, evaluate in another, deploy in a third” fragmentation problem. If you just need a simple phone automation script, it’s overkill.