Is conversational-datasets open source?

Yes — PolyAI-LDN/conversational-datasets is open source, released under the Apache-2.0 license.

What language is conversational-datasets written in?

PolyAI-LDN/conversational-datasets is primarily written in Python.

How popular is conversational-datasets?

PolyAI-LDN/conversational-datasets has 1.4k stars on GitHub.

Where can I find conversational-datasets?

PolyAI-LDN/conversational-datasets is on GitHub at https://github.com/PolyAI-LDN/conversational-datasets.

← all repositories

PolyAI-LDN/conversational-datasets

Chatbot training data, but you bake it yourself

Reproducible pipelines for turning Reddit, movie subtitles, and Amazon Q&A into conversational datasets at scale.

★1.4k stars Python Data Tooling Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does This repo provides Apache Beam pipelines that process raw sources—3.7 billion Reddit comments, 400 million subtitle lines, 3.6 million Amazon Q&A pairs—into structured conversational datasets for training response-selection models. You run the scripts on Google Dataflow, not download a zip file.

The interesting bit The authors don’t hand you the data; they hand you the recipe. Deterministic train/test splits mean two researchers running the same pipeline get identical datasets—rare in an era where pre-training datasets are black boxes. The “1-of-100” ranking metric, computed by treating batch-mates as random negatives, is a pragmatic hack that became a community standard.

Key highlights

Reddit dataset: 654M training examples after aggressive filtering (no [deleted], no orphans, no novels)
Standard format with reverse-indexed context history (context/0, context/1…) so mixed-length conversations need no padding
Outputs JSON lines or TensorFlow Records; includes utilities for inspecting TFRecord contents
Baseline implementations and benchmark results included
Multi-language support via OpenSubtitles

Caveats

Requires Python 2.7 (explicitly; this is legacy code)
Google Cloud dependency: Dataflow, Cloud Storage, service account setup—not a laptop project
README notes shuffle randomness among files isn’t reproducible, only the train/test split is

Verdict Worth it if you’re building retrieval-based chatbots and need comparable, citable baselines. Skip if you want drop-in HuggingFace datasets or lack GCP budget.

Frequently asked

What is PolyAI-LDN/conversational-datasets?: Reproducible pipelines for turning Reddit, movie subtitles, and Amazon Q&A into conversational datasets at scale.
Is conversational-datasets open source?: Yes — PolyAI-LDN/conversational-datasets is open source, released under the Apache-2.0 license.
What language is conversational-datasets written in?: PolyAI-LDN/conversational-datasets is primarily written in Python.
How popular is conversational-datasets?: PolyAI-LDN/conversational-datasets has 1.4k stars on GitHub.
Where can I find conversational-datasets?: PolyAI-LDN/conversational-datasets is on GitHub at https://github.com/PolyAI-LDN/conversational-datasets.