Google's search logs, annotated and served as QA fuel
A dataset of real user questions paired with Wikipedia answers, built to stress-test reading comprehension on actual human curiosity.

What it does Natural Questions (NQ) is a Google-produced dataset for training and evaluating question-answering systems. It pairs 307,372 real search queries with entire Wikipedia pages, annotated by humans who identified both long answers (the smallest HTML bounding box containing the answer) and short answers (exact text spans or yes/no judgments).
The interesting bit The dataset keeps the raw HTML — not just extracted text — so models can theoretically exploit document structure like tables and lists. There’s also a built-in tension between two tasks: finding the right paragraph or table (long answer, human F1 ~87%) and pinpointing the exact words (short answer, human F1 ~76%).
Key highlights
- 307K training examples, 7.8K dev, 7.8K hidden test
- Long answers are HTML elements: 72.9% paragraphs, 19% tables
- Short answers are token spans; 90% are single spans, but some are multi-span lists
- Includes a simplified text-extracted version plus raw HTML with byte-level offsets
- Evaluation accepts predictions as either token offsets or raw byte offsets into the original HTML
- Docker-based submission required for the hidden test set leaderboard
Caveats
- The competition site only provides the original HTML format, not the simplified one
- The
top_levelflag on long answer candidates is a convenience, not part of the actual task definition - Baseline systems and TensorFlow dataset code live in a separate repository, not this one
Verdict Worth grabbing if you’re building or benchmarking extractive QA systems and want real user queries rather than synthetic ones. Skip it if you need conversational multi-turn QA or domain-specific corpora — this is strictly single-turn, Wikipedia-grounded search questions.