openai/mle-bench
A benchmark suite from OpenAI for measuring AI agent performance on machine learning engineering challenges.

Velocity · 7d
+2.6
★ / day
Trend
→steady
star history
MLE-bench evaluates how well AI agents perform at machine learning engineering tasks by running them through a set of standardized ML competitions. The repository includes the dataset construction code, evaluation logic, and baseline agent implementations. The benchmark measures agent capabilities across different difficulty levels (Low/Medium/High) and tracks performance metrics like accuracy and running time.