xlang-ai/OSWorld
OSWorld is a benchmark suite for evaluating multimodal AI agents on open-ended computer tasks in real environments.

OSWorld provides a standardized evaluation framework for measuring how well AI agents (LLMs, VLMs, large action models) can complete tasks in real operating system environments. It supports benchmarking across CLI, GUI, and web interactions, covering diverse domains like coding, file management, and application control. The benchmark includes verified task instances, evaluation infrastructure, and supports parallelized evaluation through AWS integration.