Is GPT-4V-Act open source?

Yes — ddupont808/GPT-4V-Act is an open-source project tracked on heatdrop.

What language is GPT-4V-Act written in?

ddupont808/GPT-4V-Act is primarily written in JavaScript.

How popular is GPT-4V-Act?

ddupont808/GPT-4V-Act has 1.1k stars on GitHub.

Where can I find GPT-4V-Act?

ddupont808/GPT-4V-Act is on GitHub at https://github.com/ddupont808/GPT-4V-Act.

← all repositories

ddupont808/GPT-4V-Act

This GPT-4V agent drives Chromium by reading numbered UI stickers

It turns vague visual reasoning into exact mouse clicks by numbering every button on the page.

★1.1k stars JavaScript Agents

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

GPT-4V-Act is an experimental Chromium agent that lets GPT-4V(ision) operate a browser through screenshots alone. A JavaScript auto-labeler tags every interactive DOM element with a unique numeric ID, then feeds the annotated screenshot to the model. GPT-4V replies with structured JSON—perhaps clicking element 7 or typing into element 4—which the system translates into real mouse and keyboard events. The stated goals are workflow automation, UI testing, and accessibility assistance.

The interesting bit

The project uses “Set-of-Mark Prompting” to turn a messy visual task into a tidy multiple-choice problem. Rather than asking a vision model to guess pixel coordinates from a raw screenshot, it slaps temporary numeric stickers on every button and field. That lets GPT-4V reason about discrete labels instead of continuous space, which is a far more reliable way to make a generalist model behave like a precision input device.

Key highlights

Labels interactable elements via a JS DOM heuristic and can export the annotations in COCO format.
Interprets GPT-4V’s JSON output (ClickAction, TypeAction, etc.) into actual browser input.
Explicitly grounded in the Set-of-Mark Prompting research paper for coordinate-free UI interaction.
The author now points to Microsoft’s Windows Agent Arena as the successor incorporating the full planned feature set, including AI labeling for desktop UIs.

Caveats

Feature coverage is spotty: scrolling, special keycodes (Enter, Page Up, etc.), and an AI-powered labeler are all listed as not yet implemented.
The agent cannot ask the user for clarification or retain task-relevant memory across steps.
A September 2024 update from the author suggests the project’s forward-looking roadmap has been absorbed by Microsoft’s Windows Agent Arena.

Verdict

Researchers and tinkerers studying vision-based UI automation should treat this as a concise, open-source reference for Set-of-Mark browser control. If you need a robust, production-ready agent today, the author would likely tell you to look at Windows Agent Arena instead.

Frequently asked

What is ddupont808/GPT-4V-Act?: It turns vague visual reasoning into exact mouse clicks by numbering every button on the page.
Is GPT-4V-Act open source?: Yes — ddupont808/GPT-4V-Act is an open-source project tracked on heatdrop.
What language is GPT-4V-Act written in?: ddupont808/GPT-4V-Act is primarily written in JavaScript.
How popular is GPT-4V-Act?: ddupont808/GPT-4V-Act has 1.1k stars on GitHub.
Where can I find GPT-4V-Act?: ddupont808/GPT-4V-Act is on GitHub at https://github.com/ddupont808/GPT-4V-Act.