DeepMind's 2015 QA corpus: still useful, still brittle
A dataset generation pipeline that scrapes the Wayback Machine to teach machines reading comprehension — when the URLs haven't rotted away.

What it does
This repo generates ~1 million reading-comprehension question/answer pairs from CNN and Daily Mail articles. You feed it article metadata, it fetches the actual text from the Internet Archive’s Wayback Machine, then produces cloze-style questions where an entity is replaced by a placeholder. The output is a flat file per question with context, question, answer, and entity mapping.
The interesting bit
The whole pipeline is a bet on web archaeology. The questions are generated heuristically from article summaries, not hand-written — cheap to scale, but the quality depends entirely on whether the Wayback Machine still has that 2015 news article cached. DeepMind also hosts a pre-processed mirror in case your scrape fails, which is a quiet admission that this is fragile.
Key highlights
- Generates the CNN/Daily Mail dataset from Hermann et al., NIPS 2015
- Requires Python 2.7 and a very specific libxml2 2.9.1 (not a typo)
- Daily Mail output is ~1 million small files — SSD strongly preferred
- Includes verification scripts to check test set completeness against expected filenames
- Fallback download link at NYU if Wayback Machine is “partially down”
Caveats
- Python 2.7 dependency in 2024 is a genuine obstacle
- Some Wayback Machine URLs are simply gone; the script handles missing data gracefully but your corpus shrinks
- The README warns that libxml2 version pinning and system-level dependencies can make installation fussy
Verdict
Worth a look if you’re reproducing 2015-era reading comprehension baselines or studying how synthetic QA datasets age. Skip it if you want modern, maintained tooling — this is a reference implementation with cobwebs, not a framework.