A junk drawer of Chinese chatbot training data
Curated links to movie subtitles, SMS dumps, and forum Q&A for anyone building Chinese-language conversational AI on a budget.

What it does This repo is essentially a bookmark collection: eight external datasets for training Chinese (and some English) chatbots, from movie dialogue and SMS messages to insurance Q&A pairs. The author gathered links others would otherwise hunt down individually. A few small files are mirrored directly; most point elsewhere.
The interesting bit The value is in the curation, not creation. The author flags quality honestly—movie subtitles are “noisy” with mismatched Q&A pairs, ChatterBot’s corpus is “small but high quality.” That frankness saves you from downloading garbage. The insurance QA dataset is the most structured, with explicit train/test/validation splits and a 1:10 positive-to-negative ratio.
Key highlights
- 8 dataset categories spanning subtitles, SMS, lyrics, tweets, and domain-specific Q&A
- Insurance QA: ~13K questions, 142K training pairs, pre-split for benchmarking
- Egret forum corpus: 2,907 human-reviewed Q&A pairs with “best answer” labels
- Xiaohuangji corpus: 500K pairs, pre-tokenized and raw versions both available
- Explicit notes on data quality (noisy subtitles, small but clean ChatterBot set)
Caveats
- Mostly outbound links; several sources could rot or move
- No code, no preprocessing scripts, no unified format—just pointers and occasional ZIP backups
- “Unpublished corpora” section is aspirational (e.g., Microsoft XiaoIce) with no actual data
Verdict Worth a star if you’re starting Chinese NLP and need a map of where the free data lives. Skip it if you expect downloadable, cleaned, ready-to-tensor datasets—this is a signpost, not a pipeline.