harbor-framework/harbor
An agent evaluation framework for benchmarking and optimizing AI agents and language models across RL environments.

Velocity · 7d
+7.6
★ / day
Trend
→steady
star history
Harbor provides a unified harness for running agent evaluations across multiple benchmarks including Terminal-Bench, SWE-Bench, and Aider Polyglot. It supports evaluating various AI coding agents and supports distributed execution through cloud providers like Daytona and Modal. The framework enables generating rollouts for reinforcement learning optimization and includes tools for benchmarking agents against established evaluation datasets.