← all repositories

openai/frontier-evals

OpenAI's open-source framework for evaluating frontier AI model capabilities using structured benchmarks.

1.2k stars Python LLMOps · Eval
frontier-evals
Velocity · 7d
+2.8
★ / day
Trend
steady
star history

Frontier Evals provides reproducible evaluation suites for assessing state-of-the-art AI models on complex tasks. It includes PaperBench for replicating AI research papers, SWE-Lancer for real software engineering freelance tasks, and EVMBench for smart contract security testing. Each benchmark runs models end-to-end against verifiable ground-truth outcomes and uses uv for environment management.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.