THUDM/AgentBench
A benchmark suite for evaluating large language models as autonomous agents across diverse real-world tasks.

Velocity · 7d
+3.3
★ / day
Trend
→steady
star history
AgentBench provides a comprehensive framework for assessing LLM performance in agentic scenarios including OS interaction, database querying, knowledge graph reasoning, and web shopping. It supports function-calling style evaluations with fully-containerized deployment for standardized benchmarking. The project integrates with AgentRL to offer end-to-end multitask and multiturn LLM agent reinforcement learning capabilities.