← all repositories

niuzaisheng/ScreenAgent

A Visual Language Model agent that autonomously controls computers by observing screenshots and executing mouse/keyboard actions.

602 stars Python AgentsComputer Vision
ScreenAgent
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

ScreenAgent is a framework enabling Visual Language Models to interact with real computer screens. The agent observes screenshots, plans task breakdowns, executes mouse and keyboard operations, and reflects on execution results. It implements a continuous planning-execution-reflection loop to guide the agent through multi-step desktop tasks. The project includes a dataset of screenshots and action sequences for training and evaluation.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.