← all repositories

benchflow-ai/skillsbench

Benchmark framework that evaluates AI agent performance on skill-based tasks using gym-style evaluation.

1.3k stars PDDL LLMOps · EvalAgents
skillsbench
Velocity · 7d
+8.1
★ / day
Trend
steady
star history

SkillsBench measures agent effectiveness by benchmarking how well AI agents compose and use modular skills—folders containing instructions, scripts, and resources for specialized workflows. It evaluates both skill quality and agent behavior, targeting major models like Claude, GPT, MiniMax, and GLM. Tasks requiring composition of two or more skills with less than 50% SOTA performance are prioritized. The framework uses PDDL for planning domains and integrates with Hugging Face datasets for distribution.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.