← all repositories

bigcode-project/bigcodebench

A benchmark for evaluating large language models on code generation tasks with a public leaderboard.

503 stars Python LLMOps · Eval
bigcodebench
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

BigCodeBench is a code generation benchmark published at ICLR'25 that evaluates how well LLMs perform at synthesizing programs with tool use and function calling. It provides a standardized evaluation harness, a public leaderboard on HuggingFace, Docker-based evaluation infrastructure, and benchmarks LLMs across multiple dimensions including instruction-following, code completion, and agentic tool-use scenarios.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.