← all repositories

harbor-framework/terminal-bench

A benchmark suite for evaluating how well AI agents perform real-world terminal tasks like compiling code and training models.

2.3k stars Python LLMOps · EvalAgents
terminal-bench
Velocity · 7d
+4.6
★ / day
Trend
steady
star history

Terminal-Bench is an evaluation framework for testing LLM agents in realistic terminal environments. It provides reproducible task suites covering end-to-end challenges such as compiling code, training machine learning models, and setting up servers. The benchmark measures agent performance on system-level reasoning and autonomous task completion across multi-step scenarios.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.