← all repositories

openai/mle-bench

A benchmark suite from OpenAI for measuring AI agent performance on machine learning engineering challenges.

1.6k stars Python LLMOps · EvalAgents
mle-bench
Velocity · 7d
+2.6
★ / day
Trend
steady
star history

MLE-bench evaluates how well AI agents perform at machine learning engineering tasks by running them through a set of standardized ML competitions. The repository includes the dataset construction code, evaluation logic, and baseline agent implementations. The benchmark measures agent capabilities across different difficulty levels (Low/Medium/High) and tracks performance metrics like accuracy and running time.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.