← all repositories

THUDM/AgentBench

A benchmark suite for evaluating large language models as autonomous agents across diverse real-world tasks.

3.5k stars Python LLMOps · EvalAgents
AgentBench
Velocity · 7d
+3.3
★ / day
Trend
steady
star history

AgentBench provides a comprehensive framework for assessing LLM performance in agentic scenarios including OS interaction, database querying, knowledge graph reasoning, and web shopping. It supports function-calling style evaluations with fully-containerized deployment for standardized benchmarking. The project integrates with AgentRL to offer end-to-end multitask and multiturn LLM agent reinforcement learning capabilities.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.