← all repositories
microsoft/OmniParser

Microsoft's vision-only UI parser turns screenshots into LLM-ready structure

OmniParser extracts clickable elements from raw screenshots so vision models can actually *do* things on a desktop without peeking at the DOM.

24.9k stars Jupyter Notebook AgentsComputer Vision
OmniParser
Velocity · 7d
+40
★ / day
Trend
steady
star history

What it does OmniParser takes a screenshot of any GUI and spits back structured, labeled elements — bounding boxes, descriptions, and whether each thing is actually clickable. Feed that into GPT-4V, Claude, or a local Qwen/DeepSeek model and you have a vision-based agent that can point, click, and type without ever touching the underlying HTML or accessibility tree.

The interesting bit The “pure vision” angle is the twist: no API hooks, no OS-level automation shims, just pixels in and structured intent out. Microsoft pairs it with OmniTool to drive a Windows 11 VM, but the parser itself is model-agnostic. The V2 release claims 39.5% on Screen Spot Pro, a grounding benchmark — whether that translates to real-world reliability is the open question.

Key highlights

  • Two-stage pipeline: YOLO-based icon detection + Florence/BLIP captioning for element descriptions
  • V2 checkpoints available on HuggingFace; V1.5 added interactability prediction and small-icon detection
  • OmniTool integration supports OpenAI, Anthropic Computer Use, DeepSeek R1, and Qwen 2.5VL out of the box
  • Gradio demo and Jupyter notebook examples included
  • Trajectory logging (March 2025) for building training datasets from real runs

Caveats

  • “Documentation WIP” for the new training-data pipeline — Microsoft’s own words
  • The icon_detect model inherits AGPL from YOLO; the rest is MIT, so check your compliance before shipping
  • Multi-agent orchestration in OmniTool is “gradually adding” — not fully baked

Verdict Worth a look if you’re building desktop automation or GUI agents and want to decouple from platform-specific accessibility APIs. Skip it if you need battle-tested production reliability today; the rough edges are visible and acknowledged.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.