Is text2sql-data open source?

Yes — jkkummerfeld/text2sql-data is an open-source project tracked on heatdrop.

What language is text2sql-data written in?

jkkummerfeld/text2sql-data is primarily written in Python.

How popular is text2sql-data?

jkkummerfeld/text2sql-data has 587 stars on GitHub.

Where can I find text2sql-data?

jkkummerfeld/text2sql-data is on GitHub at https://github.com/jkkummerfeld/text2sql-data.

← all repositories

jkkummerfeld/text2sql-data

Text-to-SQL datasets that admit they aren't perfect

A curated, corrected collection of natural-language-to-SQL benchmarks with annotated variables and schemas across nine domains.

★587 stars Python Data Tooling Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository bundles and cleans up datasets for training and evaluating systems that convert plain English questions into SQL queries. For each of nine domains—Academic, Advising, ATIS, Geography, Restaurants, Scholar, Spider, IMDB, Yelp, and WikiSQL—it provides the natural-language questions, corresponding SQL, database schemas, and actual databases. The authors also include baseline systems and evaluation tools.

The interesting bit

The authors explicitly version their data fixes and keep a running list of known issues, which is refreshingly honest for a benchmark repository. They also annotate variables in questions (like “[course name]” or “[instructor]”) rather than leaving you to guess which tokens map to database values.

Key highlights

Nine domains ranging from flight booking (ATIS) to movie ratings (IMDB/Yelp) to the large cross-domain Spider set
Versioned releases: currently at v4, with documented changelog of what got fixed each time
Includes data from prior benchmarks but with corrections for mislabeled variables and other bugs
Provides citation templates for every original source so you don’t accidentally snub prior work
Code for systems and evaluation tools included in separate directories

Caveats

The authors warn none of the datasets are perfect; fixes accumulate on a development branch and only merge to master infrequently
Some domains are quite small (original ATIS and Geography date to the 1990s)
You’ll need to handle the variable annotation format yourself; it’s not automatically resolved to database values

Verdict

Grab this if you’re building or benchmarking a text-to-SQL model and want pre-cleaned data with honest documentation of its flaws. Skip it if you need a single massive, homogeneous dataset—Spider and WikiSQL are large, but the rest are boutique by modern standards.

Frequently asked

What is jkkummerfeld/text2sql-data?: A curated, corrected collection of natural-language-to-SQL benchmarks with annotated variables and schemas across nine domains.
Is text2sql-data open source?: Yes — jkkummerfeld/text2sql-data is an open-source project tracked on heatdrop.
What language is text2sql-data written in?: jkkummerfeld/text2sql-data is primarily written in Python.
How popular is text2sql-data?: jkkummerfeld/text2sql-data has 587 stars on GitHub.
Where can I find text2sql-data?: jkkummerfeld/text2sql-data is on GitHub at https://github.com/jkkummerfeld/text2sql-data.