← all repositories
candlewill/Dialog_Corpus

A junk drawer of Chinese chatbot training data

Curated links to movie subtitles, SMS dumps, and forum Q&A for anyone building Chinese-language conversational AI on a budget.

2.1k stars Python Data ToolingChat Assistants
Dialog_Corpus
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does This repo is essentially a bookmark collection: eight external datasets for training Chinese (and some English) chatbots, from movie dialogue and SMS messages to insurance Q&A pairs. The author gathered links others would otherwise hunt down individually. A few small files are mirrored directly; most point elsewhere.

The interesting bit The value is in the curation, not creation. The author flags quality honestly—movie subtitles are “noisy” with mismatched Q&A pairs, ChatterBot’s corpus is “small but high quality.” That frankness saves you from downloading garbage. The insurance QA dataset is the most structured, with explicit train/test/validation splits and a 1:10 positive-to-negative ratio.

Key highlights

  • 8 dataset categories spanning subtitles, SMS, lyrics, tweets, and domain-specific Q&A
  • Insurance QA: ~13K questions, 142K training pairs, pre-split for benchmarking
  • Egret forum corpus: 2,907 human-reviewed Q&A pairs with “best answer” labels
  • Xiaohuangji corpus: 500K pairs, pre-tokenized and raw versions both available
  • Explicit notes on data quality (noisy subtitles, small but clean ChatterBot set)

Caveats

  • Mostly outbound links; several sources could rot or move
  • No code, no preprocessing scripts, no unified format—just pointers and occasional ZIP backups
  • “Unpublished corpora” section is aspirational (e.g., Microsoft XiaoIce) with no actual data

Verdict Worth a star if you’re starting Chinese NLP and need a map of where the free data lives. Skip it if you expect downloadable, cleaned, ready-to-tensor datasets—this is a signpost, not a pipeline.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.