web-arena-x/visualwebarena
A realistic benchmark for evaluating multimodal autonomous language agents on diverse web-based visual tasks.

Velocity · 7d
+0.6
★ / day
Trend
→steady
star history
VisualWebArena is a benchmark framework for evaluating multimodal agents through diverse and complex web-based visual tasks. It builds on WebArena with a reproducible, execution-based evaluation approach. The benchmark includes agent trajectories (such as GPT-4V with Set-of-Marks) and provides tools for end-to-end training and environment reset, enabling systematic assessment of multimodal agent capabilities.