Is OmniParser open source?

Yes — microsoft/OmniParser is open source, released under the CC-BY-4.0 license.

What language is OmniParser written in?

microsoft/OmniParser is primarily written in Jupyter Notebook.

How popular is OmniParser?

microsoft/OmniParser has 25.2k stars on GitHub and is currently accelerating.

Where can I find OmniParser?

microsoft/OmniParser is on GitHub at https://github.com/microsoft/OmniParser.

← all repositories

microsoft/OmniParser

A DOM-Free Way to Stop LLMs From Misclicking

OmniParser turns raw screenshots into structured, labeled UI elements so vision-language models can finally click what they mean to click.

★25.2k stars Jupyter Notebook Agents Computer Vision

View on GitHub ↗

Velocity · 7d

+27

★ / day

Trend

↗accelerating

star history

What it does OmniParser ingests a screenshot and emits a structured breakdown of interface elements—icons, buttons, regions—plus plain-language descriptions of what each does. This gives vision-language models like GPT-4V a concrete coordinate map to ground their actions instead of guessing where to click. The project also ships as OmniTool, a bundled Windows 11 VM controller that wires the parser directly to several major LLMs.

The interesting bit The pipeline is purely visual: it treats the screen as an image and relies on computer-vision models—a YOLO-derived detector and a Florence captioner—rather than scraping accessibility trees or HTML. That means it can, in theory, operate on any GUI it can screenshot, not just web pages.

Key highlights

V2 claims 39.5% accuracy on the Screen Spot Pro grounding benchmark; the README notes this was achieved with a version initially slated for release and published the following month.
V1.5 added fine-grained small-icon detection and predicts whether each detected element is actually interactable.
OmniTool supports plugging in OpenAI, DeepSeek R1, Qwen 2.5VL, or Anthropic Computer Use as the reasoning backend.
The icon detection model inherits an AGPL license from its YOLO roots, while the caption models are MIT—a split license that matters for commercial redistribution.
Recent updates add local trajectory logging for building training datasets, though the documentation is marked work-in-progress.

Caveats

Documentation for the new training-pipeline features is still WIP.
Multi-agent orchestration is described as being “gradually added,” so expect scaffolding rather than a finished system.
License mixing (AGPL detector + MIT captioner) complicates redistribution if you ship the full stack.

Verdict Worth a look if you’re building computer-use agents and want a vision-only grounding layer without DOM dependencies. Skip it if you need a mature, fully-documented orchestration framework today.

Frequently asked

What is microsoft/OmniParser?: OmniParser turns raw screenshots into structured, labeled UI elements so vision-language models can finally click what they mean to click.
Is OmniParser open source?: Yes — microsoft/OmniParser is open source, released under the CC-BY-4.0 license.
What language is OmniParser written in?: microsoft/OmniParser is primarily written in Jupyter Notebook.
How popular is OmniParser?: microsoft/OmniParser has 25.2k stars on GitHub and is currently accelerating.
Where can I find OmniParser?: microsoft/OmniParser is on GitHub at https://github.com/microsoft/OmniParser.