← all repositories

harbor-framework/harbor

An agent evaluation framework for benchmarking and optimizing AI agents and language models across RL environments.

2.3k stars Python AgentsLLMOps · Eval
harbor
Velocity · 7d
+7.6
★ / day
Trend
steady
star history

Harbor provides a unified harness for running agent evaluations across multiple benchmarks including Terminal-Bench, SWE-Bench, and Aider Polyglot. It supports evaluating various AI coding agents and supports distributed execution through cloud providers like Daytona and Modal. The framework enables generating rollouts for reinforcement learning optimization and includes tools for benchmarking agents against established evaluation datasets.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.