← all repositories

bird-bench/BIRD-CRITIC-1

A benchmark suite evaluating LLMs on real-world SQL user issue resolution tasks.

1.1k stars Python LLMOps · EvalData Tooling
BIRD-CRITIC-1
Velocity · 7d
+2.2
★ / day
Trend
steady
star history

BIRD-CRITIC-1.0 is a NeurIPS 2025 benchmark for evaluating large language models on software engineering tasks involving SQL. It focuses on realistic database application issues across SQLite and other engines, providing datasets of user SQL problems, ground-truth solutions, and test cases. The repository includes evaluation code, a leaderboard, and integrates with HuggingFace for dataset hosting.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.