Chatbot training data: crowdsourced, YAML-flavored, occasionally wrong
A community-contributed multilingual corpus for bootstrapping ChatterBot when you have nothing else to say.

What it does
ChatterBot Corpus is a collection of user-contributed conversation datasets in YAML format, designed to prime fresh ChatterBot installations with basic dialog across multiple languages. You drop these files into chatterbot_corpus/data/, point your bot at them, and get something that can respond to “Hello” without embarrassing silence.
The interesting bit
The project treats training data as plain-text infrastructure — no databases, no proprietary formats, just categorized YAML files anyone can edit. The README includes a slightly overwrought Daniel Read quote about unit testing, which feels like a quiet apology for the fact that community-contributed content may contain “occasional mistakes or inaccuracies.”
Key highlights
- Multilingual coverage, though specific language counts aren’t listed
- Simple YAML schema: categories header plus paired conversation lines
- Extensible: add new languages by creating directories and pull requests
- Distributed via PyPI as a companion package to ChatterBot
- Includes basic unittest suite (
python -Wonce -m unittest discover)
Caveats
- Content quality varies; the maintainers explicitly warn of potential errors in user submissions
- Documentation link (
http://corpus.chatterbot.us/) is referenced but not described in detail - No visible versioning or quality metrics for individual language datasets
Verdict
Useful if you’re building with ChatterBot and need starter data faster than you can write it. Skip it if you need guaranteed-accurate, professionally curated dialog — or if you’ve already moved on to retrieval-augmented generation and wonder why you’re reading about YAML chatbot training in 2024.