Chatbot training data, but you bake it yourself
Reproducible pipelines for turning Reddit, movie subtitles, and Amazon Q&A into conversational datasets at scale.

What it does This repo provides Apache Beam pipelines that process raw sources—3.7 billion Reddit comments, 400 million subtitle lines, 3.6 million Amazon Q&A pairs—into structured conversational datasets for training response-selection models. You run the scripts on Google Dataflow, not download a zip file.
The interesting bit The authors don’t hand you the data; they hand you the recipe. Deterministic train/test splits mean two researchers running the same pipeline get identical datasets—rare in an era where pre-training datasets are black boxes. The “1-of-100” ranking metric, computed by treating batch-mates as random negatives, is a pragmatic hack that became a community standard.
Key highlights
- Reddit dataset: 654M training examples after aggressive filtering (no
[deleted], no orphans, no novels) - Standard format with reverse-indexed context history (
context/0,context/1…) so mixed-length conversations need no padding - Outputs JSON lines or TensorFlow Records; includes utilities for inspecting TFRecord contents
- Baseline implementations and benchmark results included
- Multi-language support via OpenSubtitles
Caveats
- Requires Python 2.7 (explicitly; this is legacy code)
- Google Cloud dependency: Dataflow, Cloud Storage, service account setup—not a laptop project
- README notes shuffle randomness among files isn’t reproducible, only the train/test split is
Verdict Worth it if you’re building retrieval-based chatbots and need comparable, citable baselines. Skip if you want drop-in HuggingFace datasets or lack GCP budget.