mgechev/skillgrade
A testing framework that runs unit-test-style evaluations of AI agent skills against configurable rubrics and task definitions.

Skillgrade provides a framework for evaluating AI agent capabilities by executing tasks defined in eval.yaml and grading agent responses. It supports multiple agent backends (Claude, Codex, Gemini, OpenCode) and two grader types: deterministic (exact match) and llm_rubric (LLM-based evaluation). Users scaffold eval configs with AI assistance or manually, run configurable trial counts (smoke/reliable/regression), and preview results via CLI or browser UI.