Is natural-questions open source?

Yes — google-research-datasets/natural-questions is open source, released under the Apache-2.0 license.

What language is natural-questions written in?

google-research-datasets/natural-questions is primarily written in Python.

How popular is natural-questions?

google-research-datasets/natural-questions has 1.1k stars on GitHub.

Where can I find natural-questions?

google-research-datasets/natural-questions is on GitHub at https://github.com/google-research-datasets/natural-questions.

← all repositories

google-research-datasets/natural-questions

Google's search logs, annotated and served as QA fuel

A dataset of real user questions paired with Wikipedia answers, built to stress-test reading comprehension on actual human curiosity.

★1.1k stars Python Data Tooling Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does Natural Questions (NQ) is a Google-produced dataset for training and evaluating question-answering systems. It pairs 307,372 real search queries with entire Wikipedia pages, annotated by humans who identified both long answers (the smallest HTML bounding box containing the answer) and short answers (exact text spans or yes/no judgments).

The interesting bit The dataset keeps the raw HTML — not just extracted text — so models can theoretically exploit document structure like tables and lists. There’s also a built-in tension between two tasks: finding the right paragraph or table (long answer, human F1 ~87%) and pinpointing the exact words (short answer, human F1 ~76%).

Key highlights

307K training examples, 7.8K dev, 7.8K hidden test
Long answers are HTML elements: 72.9% paragraphs, 19% tables
Short answers are token spans; 90% are single spans, but some are multi-span lists
Includes a simplified text-extracted version plus raw HTML with byte-level offsets
Evaluation accepts predictions as either token offsets or raw byte offsets into the original HTML
Docker-based submission required for the hidden test set leaderboard

Caveats

The competition site only provides the original HTML format, not the simplified one
The top_level flag on long answer candidates is a convenience, not part of the actual task definition
Baseline systems and TensorFlow dataset code live in a separate repository, not this one

Verdict Worth grabbing if you’re building or benchmarking extractive QA systems and want real user queries rather than synthetic ones. Skip it if you need conversational multi-turn QA or domain-specific corpora — this is strictly single-turn, Wikipedia-grounded search questions.

Frequently asked

What is google-research-datasets/natural-questions?: A dataset of real user questions paired with Wikipedia answers, built to stress-test reading comprehension on actual human curiosity.
Is natural-questions open source?: Yes — google-research-datasets/natural-questions is open source, released under the Apache-2.0 license.
What language is natural-questions written in?: google-research-datasets/natural-questions is primarily written in Python.
How popular is natural-questions?: google-research-datasets/natural-questions has 1.1k stars on GitHub.
Where can I find natural-questions?: google-research-datasets/natural-questions is on GitHub at https://github.com/google-research-datasets/natural-questions.