← all repositories
google-deepmind/rc-data

DeepMind's 2015 QA corpus: still useful, still brittle

A dataset generation pipeline that scrapes the Wayback Machine to teach machines reading comprehension — when the URLs haven't rotted away.

1.3k stars Python Data ToolingLanguage Models
rc-data
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

This repo generates ~1 million reading-comprehension question/answer pairs from CNN and Daily Mail articles. You feed it article metadata, it fetches the actual text from the Internet Archive’s Wayback Machine, then produces cloze-style questions where an entity is replaced by a placeholder. The output is a flat file per question with context, question, answer, and entity mapping.

The interesting bit

The whole pipeline is a bet on web archaeology. The questions are generated heuristically from article summaries, not hand-written — cheap to scale, but the quality depends entirely on whether the Wayback Machine still has that 2015 news article cached. DeepMind also hosts a pre-processed mirror in case your scrape fails, which is a quiet admission that this is fragile.

Key highlights

  • Generates the CNN/Daily Mail dataset from Hermann et al., NIPS 2015
  • Requires Python 2.7 and a very specific libxml2 2.9.1 (not a typo)
  • Daily Mail output is ~1 million small files — SSD strongly preferred
  • Includes verification scripts to check test set completeness against expected filenames
  • Fallback download link at NYU if Wayback Machine is “partially down”

Caveats

  • Python 2.7 dependency in 2024 is a genuine obstacle
  • Some Wayback Machine URLs are simply gone; the script handles missing data gracefully but your corpus shrinks
  • The README warns that libxml2 version pinning and system-level dependencies can make installation fussy

Verdict

Worth a look if you’re reproducing 2015-era reading comprehension baselines or studying how synthetic QA datasets age. Skip it if you want modern, maintained tooling — this is a reference implementation with cobwebs, not a framework.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.