Is ClawProBench open source?

Yes — suyoumo/ClawProBench is open source, released under the Apache-2.0 license.

What language is ClawProBench written in?

suyoumo/ClawProBench is primarily written in Python.

How popular is ClawProBench?

suyoumo/ClawProBench has 817 stars on GitHub.

Where can I find ClawProBench?

suyoumo/ClawProBench is on GitHub at https://github.com/suyoumo/ClawProBench.

← all repositories

suyoumo/ClawProBench

Grading agents with real execution instead of multiple-choice vibes

ClawProBench exists to score LLM agents by running them live inside the OpenClaw runtime, because static answer keys don't catch real-world failure modes.

★817 stars Python LLMOps · Eval Agents

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

ClawProBench is a harness that runs LLM agents through real tasks inside a local OpenClaw runtime, grades the results deterministically, and produces structured reports and leaderboard entries. It ships with 102 active scenarios across six domains, supports multi-trial runs, and can resume interrupted benchmarks rather than starting from scratch. The default ranking uses a 26-scenario core profile, though you can opt into broader slices like intelligence or native.

The interesting bit

The scoring formula deliberately weights stable repeated success—FinalScore combines average score, all-trial pass rates, and best-of-three upside—so a model that flukes once but fails twice won’t top the board. The authors also maintain a closed-dataset leaderboard alongside the open one, openly admitting that public benchmarks invite vendor optimization.

Key highlights

Live-first execution against a real OpenClaw runtime, not mocked APIs or static Q&A.
102 active scenarios (162 in catalog) with profile-based filtering: core, intelligence, coverage, native, and full.
Deterministic grading via scenario-specific custom checkers and multi-trial support.
Resume and rerun semantics for long-running evaluations, plus cost and latency tracking in reports.
Public leaderboard with 65+ evaluated models and a separate closed-dataset release to mitigate gaming.

Caveats

Requires a working local OpenClaw runtime and valid auth/config; it is not a self-contained mock environment.
The authors note that because the benchmark is fully open source, vendors can optimize specifically for the public scenarios, which is why a closed-dataset leaderboard was added.
Some tasks are adapted and reworked from earlier public benchmark sets rather than built from scratch.

Verdict

Worth a look if you are building or evaluating OpenClaw-based agents and need reproducible, execution-backed scores. Skip it if you just want a lightweight, framework-agnostic LLM benchmark that runs without a local runtime.

Frequently asked

What is suyoumo/ClawProBench?: ClawProBench exists to score LLM agents by running them live inside the OpenClaw runtime, because static answer keys don't catch real-world failure modes.
Is ClawProBench open source?: Yes — suyoumo/ClawProBench is open source, released under the Apache-2.0 license.
What language is ClawProBench written in?: suyoumo/ClawProBench is primarily written in Python.
How popular is ClawProBench?: suyoumo/ClawProBench has 817 stars on GitHub.
Where can I find ClawProBench?: suyoumo/ClawProBench is on GitHub at https://github.com/suyoumo/ClawProBench.