niuzaisheng/ScreenAgent
A Visual Language Model agent that autonomously controls computers by observing screenshots and executing mouse/keyboard actions.

ScreenAgent is a framework enabling Visual Language Models to interact with real computer screens. The agent observes screenshots, plans task breakdowns, executes mouse and keyboard operations, and reflects on execution results. It implements a continuous planning-execution-reflection loop to guide the agent through multi-step desktop tasks. The project includes a dataset of screenshots and action sequences for training and evaluation.