← all repositories
PolyAI-LDN/conversational-datasets

Chatbot training data, but you bake it yourself

Reproducible pipelines for turning Reddit, movie subtitles, and Amazon Q&A into conversational datasets at scale.

1.4k stars Python Data ToolingLanguage Models
conversational-datasets
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does This repo provides Apache Beam pipelines that process raw sources—3.7 billion Reddit comments, 400 million subtitle lines, 3.6 million Amazon Q&A pairs—into structured conversational datasets for training response-selection models. You run the scripts on Google Dataflow, not download a zip file.

The interesting bit The authors don’t hand you the data; they hand you the recipe. Deterministic train/test splits mean two researchers running the same pipeline get identical datasets—rare in an era where pre-training datasets are black boxes. The “1-of-100” ranking metric, computed by treating batch-mates as random negatives, is a pragmatic hack that became a community standard.

Key highlights

  • Reddit dataset: 654M training examples after aggressive filtering (no [deleted], no orphans, no novels)
  • Standard format with reverse-indexed context history (context/0, context/1…) so mixed-length conversations need no padding
  • Outputs JSON lines or TensorFlow Records; includes utilities for inspecting TFRecord contents
  • Baseline implementations and benchmark results included
  • Multi-language support via OpenSubtitles

Caveats

  • Requires Python 2.7 (explicitly; this is legacy code)
  • Google Cloud dependency: Dataflow, Cloud Storage, service account setup—not a laptop project
  • README notes shuffle randomness among files isn’t reproducible, only the train/test split is

Verdict Worth it if you’re building retrieval-based chatbots and need comparable, citable baselines. Skip if you want drop-in HuggingFace datasets or lack GCP budget.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.