← all repositories
simular-ai/Agent-S

An open-source computer-use agent that beat human scores

Agent S3 is the first framework to surpass human-level performance on OSWorld, and it runs on your actual desktop via PyAutoGUI.

Agent-S
Velocity · 7d
+19
★ / day
Trend
steady
star history

What it does

Agent S is an open-source “computer use agent” that drives your real machine—Linux, macOS, or Windows—by looking at the screen and issuing clicks, keystrokes, and Python/Bash commands through PyAutoGUI. You give it a task in plain English; it plans, executes, and reflects until the job is done or it hits a step limit. The latest iteration, Agent S3, is distributed as the gui-agents PyPI package and can be invoked from either a CLI or a Python SDK.

The interesting bit

The project separates “thinking” from “pointing.” A large language model (e.g., GPT-5) handles high-level planning, while a dedicated grounding model—recommended: ByteDance’s UI-TARS-1.5-7B—translates abstract actions into exact screen coordinates. This two-model design is what let Agent S3 score 72.6% on OSWorld with Behavior Best-of-N, nudging past the reported human baseline of ~72%. It also zero-shot transfers to WindowsAgentArena and AndroidWorld without retraining.

Key highlights

  • First to beat humans on OSWorld: 72.6% with Best-of-N, 66% in the standard 100-step setting.
  • Pluggable model backends: Supports OpenAI, Anthropic, Gemini, Azure, OpenRouter, and vLLM for the reasoning model.
  • Optional local code execution: --enable_local_env lets the agent run Python and Bash locally for data processing or file manipulation—convenient, but it runs as you.
  • Single-monitor constraint: The README explicitly notes the agent is designed for single-monitor setups; multi-monitor is untested or unsupported.
  • Academic pedigree: Papers accepted at ICLR 2025 and COLM 2025; S1 won Best Paper at the ICLR Agentic AI for Science Workshop.

Caveats

  • Security, not a footnote: The local coding environment executes arbitrary Python and Bash with your user permissions. The README warns twice; sandboxing is strongly advised.
  • Grounding model is mandatory: You must bring your own UI-TARS endpoint (or equivalent) and know its coordinate resolution. No grounding model, no agent.
  • Single monitor only: Multi-monitor users are out of luck for now.

Verdict

Worth a look if you’re building or benchmarking GUI agents and want a reproducible, open stack that actually tops public leaderboards. Skip it if you need a polished consumer product or multi-monitor support; this is research code with sharp edges and a mandatory BYO-model policy.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.