← all repositories

suyoumo/ClawProBench

Live-first benchmark harness for evaluating LLM agents with deterministic grading and repeated-trial reliability.

697 stars Python LLMOps · EvalAgents
ClawProBench
Velocity · 7d
+1.5
★ / day
Trend
steady
star history

ClawProBench is an evaluation framework designed to benchmark LLM agents within the OpenClaw runtime environment. It provides structured scenario catalogs with 102 active and 162 total scenarios across core, intelligence, coverage, native, and full profiles. The system emphasizes deterministic grading and repeated-trial reliability, generates structured reports, and maintains a public leaderboard for model comparison.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.