Text-to-SQL datasets that admit they aren't perfect
A curated, corrected collection of natural-language-to-SQL benchmarks with annotated variables and schemas across nine domains.

What it does
This repository bundles and cleans up datasets for training and evaluating systems that convert plain English questions into SQL queries. For each of nine domains—Academic, Advising, ATIS, Geography, Restaurants, Scholar, Spider, IMDB, Yelp, and WikiSQL—it provides the natural-language questions, corresponding SQL, database schemas, and actual databases. The authors also include baseline systems and evaluation tools.
The interesting bit
The authors explicitly version their data fixes and keep a running list of known issues, which is refreshingly honest for a benchmark repository. They also annotate variables in questions (like “[course name]” or “[instructor]”) rather than leaving you to guess which tokens map to database values.
Key highlights
- Nine domains ranging from flight booking (ATIS) to movie ratings (IMDB/Yelp) to the large cross-domain Spider set
- Versioned releases: currently at v4, with documented changelog of what got fixed each time
- Includes data from prior benchmarks but with corrections for mislabeled variables and other bugs
- Provides citation templates for every original source so you don’t accidentally snub prior work
- Code for systems and evaluation tools included in separate directories
Caveats
- The authors warn none of the datasets are perfect; fixes accumulate on a development branch and only merge to master infrequently
- Some domains are quite small (original ATIS and Geography date to the 1990s)
- You’ll need to handle the variable annotation format yourself; it’s not automatically resolved to database values
Verdict
Grab this if you’re building or benchmarking a text-to-SQL model and want pre-cleaned data with honest documentation of its flaws. Skip it if you need a single massive, homogeneous dataset—Spider and WikiSQL are large, but the rest are boutique by modern standards.