Is skillsbench open source?

Yes — benchflow-ai/skillsbench is open source, released under the Apache-2.0 license.

What language is skillsbench written in?

benchflow-ai/skillsbench is primarily written in PDDL.

How popular is skillsbench?

benchflow-ai/skillsbench has 1.6k stars on GitHub and is currently cooling off.

Where can I find skillsbench?

benchflow-ai/skillsbench is on GitHub at https://github.com/benchflow-ai/skillsbench.

← all repositories

benchflow-ai/skillsbench

A benchmark that checks if agents can use skills, not just own them

SkillsBench is the first benchmark designed to measure how effectively AI agents leverage modular skill folders—bundles of instructions, scripts, and resources—to execute specialized workflows.

★1.6k stars PDDL LLMOps · Eval Agents

View on GitHub ↗ Homepage ↗

Velocity · 7d

+7.3

★ / day

Trend

↘cooling

star history

What it does

SkillsBench evaluates AI agents through gym-style benchmarking, scoring how well they compose and execute modular skills to complete tasks. Each skill is essentially a folder containing instructions, scripts, and resources; the benchmark tests whether the agent can put them to use rather than simply possessing them. The project ships with a set of default runnable tasks and a framework for creating new ones in the Harbor task format.

The interesting bit

The tasks are intentionally designed to require composing two or more skills, with the explicit goal that even top-tier models should score below 50%—a rarity in an era where benchmarks often saturate within months. It also layers evaluation of both the skill itself and the agent’s behavior, separating tool quality from operator competence.

Key highlights

Evaluates skill effectiveness and agent behavior as distinct variables
Tasks target multi-skill composition, not single-tool calls
Built on the BenchFlow SDK and uses uv.lock for reproducible experiments
Supports Harbor format, with planned first-party support for PrimeIntellect Verifiers, OpenReward Standard, and Kaggle Benchmarks
Apache 2.0 licensed with an active Discord and weekly sync

Caveats

Five tasks are cordoned off in tasks-extra/ because they are credential-dependent or integration-incompatible
Running agents requires external API keys (e.g., Anthropic, OpenAI)

Verdict

Worth a look if you’re building or evaluating agent frameworks and need a harder, composition-focused benchmark than chatbot leaderboards. Skip it if you’re after a plug-and-play, fully offline evaluation suite—API keys and cloud integrations are effectively mandatory.

Frequently asked

What is benchflow-ai/skillsbench?: SkillsBench is the first benchmark designed to measure how effectively AI agents leverage modular skill folders—bundles of instructions, scripts, and resources—to execute specialized workflows.
Is skillsbench open source?: Yes — benchflow-ai/skillsbench is open source, released under the Apache-2.0 license.
What language is skillsbench written in?: benchflow-ai/skillsbench is primarily written in PDDL.
How popular is skillsbench?: benchflow-ai/skillsbench has 1.6k stars on GitHub and is currently cooling off.
Where can I find skillsbench?: benchflow-ai/skillsbench is on GitHub at https://github.com/benchflow-ai/skillsbench.