← all repositories
jkkummerfeld/text2sql-data

Text-to-SQL datasets that admit they aren't perfect

A curated, corrected collection of natural-language-to-SQL benchmarks with annotated variables and schemas across nine domains.

587 stars Python Data ToolingLanguage Models
text2sql-data
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This repository bundles and cleans up datasets for training and evaluating systems that convert plain English questions into SQL queries. For each of nine domains—Academic, Advising, ATIS, Geography, Restaurants, Scholar, Spider, IMDB, Yelp, and WikiSQL—it provides the natural-language questions, corresponding SQL, database schemas, and actual databases. The authors also include baseline systems and evaluation tools.

The interesting bit

The authors explicitly version their data fixes and keep a running list of known issues, which is refreshingly honest for a benchmark repository. They also annotate variables in questions (like “[course name]” or “[instructor]”) rather than leaving you to guess which tokens map to database values.

Key highlights

  • Nine domains ranging from flight booking (ATIS) to movie ratings (IMDB/Yelp) to the large cross-domain Spider set
  • Versioned releases: currently at v4, with documented changelog of what got fixed each time
  • Includes data from prior benchmarks but with corrections for mislabeled variables and other bugs
  • Provides citation templates for every original source so you don’t accidentally snub prior work
  • Code for systems and evaluation tools included in separate directories

Caveats

  • The authors warn none of the datasets are perfect; fixes accumulate on a development branch and only merge to master infrequently
  • Some domains are quite small (original ATIS and Geography date to the 1990s)
  • You’ll need to handle the variable annotation format yourself; it’s not automatically resolved to database values

Verdict

Grab this if you’re building or benchmarking a text-to-SQL model and want pre-cleaned data with honest documentation of its flaws. Skip it if you need a single massive, homogeneous dataset—Spider and WikiSQL are large, but the rest are boutique by modern standards.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.