← all repositories
google-research-datasets/natural-questions

Google's search logs, annotated and served as QA fuel

A dataset of real user questions paired with Wikipedia answers, built to stress-test reading comprehension on actual human curiosity.

1.1k stars Python Data ToolingLanguage Models
natural-questions
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does Natural Questions (NQ) is a Google-produced dataset for training and evaluating question-answering systems. It pairs 307,372 real search queries with entire Wikipedia pages, annotated by humans who identified both long answers (the smallest HTML bounding box containing the answer) and short answers (exact text spans or yes/no judgments).

The interesting bit The dataset keeps the raw HTML — not just extracted text — so models can theoretically exploit document structure like tables and lists. There’s also a built-in tension between two tasks: finding the right paragraph or table (long answer, human F1 ~87%) and pinpointing the exact words (short answer, human F1 ~76%).

Key highlights

  • 307K training examples, 7.8K dev, 7.8K hidden test
  • Long answers are HTML elements: 72.9% paragraphs, 19% tables
  • Short answers are token spans; 90% are single spans, but some are multi-span lists
  • Includes a simplified text-extracted version plus raw HTML with byte-level offsets
  • Evaluation accepts predictions as either token offsets or raw byte offsets into the original HTML
  • Docker-based submission required for the hidden test set leaderboard

Caveats

  • The competition site only provides the original HTML format, not the simplified one
  • The top_level flag on long answer candidates is a convenience, not part of the actual task definition
  • Baseline systems and TensorFlow dataset code live in a separate repository, not this one

Verdict Worth grabbing if you’re building or benchmarking extractive QA systems and want real user queries rather than synthetic ones. Skip it if you need conversational multi-turn QA or domain-specific corpora — this is strictly single-turn, Wikipedia-grounded search questions.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.