← all repositories
bytedance/UI-TARS

ByteDance's open-source agent that clicks, types, and games

A vision-language model trained with reinforcement learning to control GUIs, browsers, and even Minecraft—without brittle DOM scraping.

10.9k stars Python AgentsImage · Video · Audio
UI-TARS
Velocity · 7d
+22
★ / day
Trend
steady
star history

What it does

UI-TARS is a multimodal agent model that looks at your screen and emits actual actions—mouse clicks, drags, keyboard shortcuts, text input—via structured outputs parsed into PyAutoGUI code. It handles desktop (Windows/macOS/Linux), mobile/Android, and browser environments through three prompt templates: COMPUTER_USE, MOBILE_USE, and a stripped-down GROUNDING mode for training/evaluation.

The interesting bit

The model “thinks” before acting: a reinforcement-learned reasoning step generates explicit Thought: content before the Action:, and the README shows this consistently improves scores (e.g., Minecraft 200-task average jumps from 0.35 to 0.42 with thought enabled). The coordinate system is finicky enough that the project ships a separate README just to explain how Qwen 2.5VL’s absolute coordinates map to your screen resolution.

Key highlights

  • Open-weights 7B model on Hugging Face; also a closed desktop app variant
  • Benchmark leader on OSWorld (42.5 vs OpenAI CUA’s 36.4), Windows Agent Arena, Android World, and ScreenSpot-Pro
  • Perfect 100% scores on 13/14 Poki browser games tested (the one miss: cubinko at 0%)
  • Ships a pip install ui-tars parser to convert model outputs directly to executable PyAutoGUI code
  • UI-TARS-2 announced (Sept 2025) expanding to code and tool use; this repo still centers on 1.5

Caveats

  • The repo itself is mostly inference scripts, prompt templates, and a post-processing parser—not the training code or model weights
  • Coordinate handling is explicitly called out as a foot-gun requiring careful resolution factoring
  • Web automation users are redirected to Midscene.js; local desktop users to a separate UI-TARS-desktop repo

Verdict

Worth a look if you’re building autonomous GUI agents and want a drop-in vision model that outputs executable actions rather than API calls. Skip if you need end-to-end training infrastructure or a polished consumer product—these are research artifacts with wrappers.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.