← all repositories
beader/tianchi_nl2sql

Third place in a Chinese NL2SQL contest, explained in notebooks

A readable, medal-winning approach to turning natural language questions into SQL using BERT and a clever two-model decomposition.

559 stars Jupyter Notebook Language Models
tianchi_nl2sql
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This repo documents the 3rd-place solution to the first Chinese NL2SQL challenge, where the task is to convert a natural language question plus a table schema into a structured SQL query. The code is organized as Jupyter notebooks with utility modules, intended for learning rather than exact reproduction of the competition score.

The interesting bit

The team splits the problem into two models: one predicts which columns to select and which conditions to apply (the “where” logic), and the second figures out the actual comparison values by treating candidate combinations as binary classification problems. They also merge SELECT and aggregation into a single prediction by adding a NO_OP class for unselected columns, which keeps the architecture cleaner.

Key highlights

  • Built on keras-bert with the Chinese whole-word-masking BERT variant (Chinese-BERT-wwm)
  • Uses RAdam optimizer from Su Jianlin’s open-source implementation
  • Model 1 injects TEXT/REAL type tokens before each column header to hint schema types to BERT
  • Model 2 enumerates condition candidates and scores them independently, then merges
  • Includes a Docker image and requirements.txt for the exact TensorFlow nightly + Python 3.6 environment used during the competition

Caveats

  • The notebooks are cleaned-up for educational purposes and “will not fully reproduce the online results”
  • Requires a specific nightly TensorFlow GPU build; the README suggests Docker to avoid dependency pain

Verdict

Worth studying if you’re building NL2SQL systems or want to see how competition solutions decompose messy structured prediction into tractable BERT fine-tuning steps. Skip if you need a production-ready library—this is reference code, not a packaged tool.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.