benchflow-ai/skillsbench
Benchmark framework that evaluates AI agent performance on skill-based tasks using gym-style evaluation.

SkillsBench measures agent effectiveness by benchmarking how well AI agents compose and use modular skills—folders containing instructions, scripts, and resources for specialized workflows. It evaluates both skill quality and agent behavior, targeting major models like Claude, GPT, MiniMax, and GLM. Tasks requiring composition of two or more skills with less than 50% SOTA performance are prioritized. The framework uses PDDL for planning domains and integrates with Hugging Face datasets for distribution.