bird-bench/BIRD-CRITIC-1
A benchmark suite evaluating LLMs on real-world SQL user issue resolution tasks.

Velocity · 7d
+2.2
★ / day
Trend
→steady
star history
BIRD-CRITIC-1.0 is a NeurIPS 2025 benchmark for evaluating large language models on software engineering tasks involving SQL. It focuses on realistic database application issues across SQLite and other engines, providing datasets of user SQL problems, ground-truth solutions, and test cases. The repository includes evaluation code, a leaderboard, and integrates with HuggingFace for dataset hosting.