harbor-framework/terminal-bench
A benchmark suite for evaluating how well AI agents perform real-world terminal tasks like compiling code and training models.

Velocity · 7d
+4.6
★ / day
Trend
→steady
star history
Terminal-Bench is an evaluation framework for testing LLM agents in realistic terminal environments. It provides reproducible task suites covering end-to-end challenges such as compiling code, training machine learning models, and setting up servers. The benchmark measures agent performance on system-level reasoning and autonomous task completion across multi-step scenarios.