Is rc-data open source?

Yes — google-deepmind/rc-data is open source, released under the Apache-2.0 license.

What language is rc-data written in?

google-deepmind/rc-data is primarily written in Python.

How popular is rc-data?

google-deepmind/rc-data has 1.3k stars on GitHub.

Where can I find rc-data?

google-deepmind/rc-data is on GitHub at https://github.com/google-deepmind/rc-data.

← all repositories

google-deepmind/rc-data

DeepMind's 2015 QA corpus: still useful, still brittle

A dataset generation pipeline that scrapes the Wayback Machine to teach machines reading comprehension — when the URLs haven't rotted away.

★1.3k stars Python Data Tooling Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repo generates ~1 million reading-comprehension question/answer pairs from CNN and Daily Mail articles. You feed it article metadata, it fetches the actual text from the Internet Archive’s Wayback Machine, then produces cloze-style questions where an entity is replaced by a placeholder. The output is a flat file per question with context, question, answer, and entity mapping.

The interesting bit

The whole pipeline is a bet on web archaeology. The questions are generated heuristically from article summaries, not hand-written — cheap to scale, but the quality depends entirely on whether the Wayback Machine still has that 2015 news article cached. DeepMind also hosts a pre-processed mirror in case your scrape fails, which is a quiet admission that this is fragile.

Key highlights

Generates the CNN/Daily Mail dataset from Hermann et al., NIPS 2015
Requires Python 2.7 and a very specific libxml2 2.9.1 (not a typo)
Daily Mail output is ~1 million small files — SSD strongly preferred
Includes verification scripts to check test set completeness against expected filenames
Fallback download link at NYU if Wayback Machine is “partially down”

Caveats

Python 2.7 dependency in 2024 is a genuine obstacle
Some Wayback Machine URLs are simply gone; the script handles missing data gracefully but your corpus shrinks
The README warns that libxml2 version pinning and system-level dependencies can make installation fussy

Verdict

Worth a look if you’re reproducing 2015-era reading comprehension baselines or studying how synthetic QA datasets age. Skip it if you want modern, maintained tooling — this is a reference implementation with cobwebs, not a framework.

Frequently asked

What is google-deepmind/rc-data?: A dataset generation pipeline that scrapes the Wayback Machine to teach machines reading comprehension — when the URLs haven't rotted away.
Is rc-data open source?: Yes — google-deepmind/rc-data is open source, released under the Apache-2.0 license.
What language is rc-data written in?: google-deepmind/rc-data is primarily written in Python.
How popular is rc-data?: google-deepmind/rc-data has 1.3k stars on GitHub.
Where can I find rc-data?: google-deepmind/rc-data is on GitHub at https://github.com/google-deepmind/rc-data.