← all repositories

web-arena-x/visualwebarena

A realistic benchmark for evaluating multimodal autonomous language agents on diverse web-based visual tasks.

477 stars Python AgentsLLMOps · Eval
visualwebarena
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

VisualWebArena is a benchmark framework for evaluating multimodal agents through diverse and complex web-based visual tasks. It builds on WebArena with a reproducible, execution-based evaluation approach. The benchmark includes agent trajectories (such as GPT-4V with Set-of-Marks) and provides tools for end-to-end training and environment reset, enabling systematic assessment of multimodal agent capabilities.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.