ddupont808/GPT-4V-Act
Browser-based AI agent that uses GPT-4V's vision capabilities to autonomously control mouse and keyboard for web UI interaction and automation tasks.

GPT-4V-Act combines GPT-4V(ision) with a web browser to create an autonomous agent capable of interacting with web interfaces through low-level mouse and keyboard controls. The system employs Set-of-Mark Prompting with an auto-labeler that assigns numerical IDs to interactive UI elements, allowing the model to identify and target specific elements from screenshots. Given a task and screenshot input, the agent determines the next action and executes it using precise pixel coordinates, enabling workflow automation and automated UI testing.