An open-source computer-use agent that beat human scores
Agent S3 is the first framework to surpass human-level performance on OSWorld, and it runs on your actual desktop via PyAutoGUI.

What it does
Agent S is an open-source “computer use agent” that drives your real machine—Linux, macOS, or Windows—by looking at the screen and issuing clicks, keystrokes, and Python/Bash commands through PyAutoGUI. You give it a task in plain English; it plans, executes, and reflects until the job is done or it hits a step limit. The latest iteration, Agent S3, is distributed as the gui-agents PyPI package and can be invoked from either a CLI or a Python SDK.
The interesting bit
The project separates “thinking” from “pointing.” A large language model (e.g., GPT-5) handles high-level planning, while a dedicated grounding model—recommended: ByteDance’s UI-TARS-1.5-7B—translates abstract actions into exact screen coordinates. This two-model design is what let Agent S3 score 72.6% on OSWorld with Behavior Best-of-N, nudging past the reported human baseline of ~72%. It also zero-shot transfers to WindowsAgentArena and AndroidWorld without retraining.
Key highlights
- First to beat humans on OSWorld: 72.6% with Best-of-N, 66% in the standard 100-step setting.
- Pluggable model backends: Supports OpenAI, Anthropic, Gemini, Azure, OpenRouter, and vLLM for the reasoning model.
- Optional local code execution:
--enable_local_envlets the agent run Python and Bash locally for data processing or file manipulation—convenient, but it runs as you. - Single-monitor constraint: The README explicitly notes the agent is designed for single-monitor setups; multi-monitor is untested or unsupported.
- Academic pedigree: Papers accepted at ICLR 2025 and COLM 2025; S1 won Best Paper at the ICLR Agentic AI for Science Workshop.
Caveats
- Security, not a footnote: The local coding environment executes arbitrary Python and Bash with your user permissions. The README warns twice; sandboxing is strongly advised.
- Grounding model is mandatory: You must bring your own UI-TARS endpoint (or equivalent) and know its coordinate resolution. No grounding model, no agent.
- Single monitor only: Multi-monitor users are out of luck for now.
Verdict
Worth a look if you’re building or benchmarking GUI agents and want a reproducible, open stack that actually tops public leaderboards. Skip it if you need a polished consumer product or multi-monitor support; this is research code with sharp edges and a mandatory BYO-model policy.