bigcode-project/bigcodebench
A benchmark for evaluating large language models on code generation tasks with a public leaderboard.

Velocity · 7d
+0.7
★ / day
Trend
→steady
star history
BigCodeBench is a code generation benchmark published at ICLR'25 that evaluates how well LLMs perform at synthesizing programs with tool use and function calling. It provides a standardized evaluation harness, a public leaderboard on HuggingFace, Docker-based evaluation infrastructure, and benchmarks LLMs across multiple dimensions including instruction-following, code completion, and agentic tool-use scenarios.