Is Dialog_Corpus open source?

Yes — candlewill/Dialog_Corpus is an open-source project tracked on heatdrop.

What language is Dialog_Corpus written in?

candlewill/Dialog_Corpus is primarily written in Python.

How popular is Dialog_Corpus?

candlewill/Dialog_Corpus has 2.1k stars on GitHub.

Where can I find Dialog_Corpus?

candlewill/Dialog_Corpus is on GitHub at https://github.com/candlewill/Dialog_Corpus.

← all repositories

candlewill/Dialog_Corpus

A junk drawer of Chinese chatbot training data

Curated links to movie subtitles, SMS dumps, and forum Q&A for anyone building Chinese-language conversational AI on a budget.

★2.1k stars Python Data Tooling Chat Assistants

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does This repo is essentially a bookmark collection: eight external datasets for training Chinese (and some English) chatbots, from movie dialogue and SMS messages to insurance Q&A pairs. The author gathered links others would otherwise hunt down individually. A few small files are mirrored directly; most point elsewhere.

The interesting bit The value is in the curation, not creation. The author flags quality honestly—movie subtitles are “noisy” with mismatched Q&A pairs, ChatterBot’s corpus is “small but high quality.” That frankness saves you from downloading garbage. The insurance QA dataset is the most structured, with explicit train/test/validation splits and a 1:10 positive-to-negative ratio.

Key highlights

8 dataset categories spanning subtitles, SMS, lyrics, tweets, and domain-specific Q&A
Insurance QA: ~13K questions, 142K training pairs, pre-split for benchmarking
Egret forum corpus: 2,907 human-reviewed Q&A pairs with “best answer” labels
Xiaohuangji corpus: 500K pairs, pre-tokenized and raw versions both available
Explicit notes on data quality (noisy subtitles, small but clean ChatterBot set)

Caveats

Mostly outbound links; several sources could rot or move
No code, no preprocessing scripts, no unified format—just pointers and occasional ZIP backups
“Unpublished corpora” section is aspirational (e.g., Microsoft XiaoIce) with no actual data

Verdict Worth a star if you’re starting Chinese NLP and need a map of where the free data lives. Skip it if you expect downloadable, cleaned, ready-to-tensor datasets—this is a signpost, not a pipeline.

Frequently asked

What is candlewill/Dialog_Corpus?: Curated links to movie subtitles, SMS dumps, and forum Q&A for anyone building Chinese-language conversational AI on a budget.
Is Dialog_Corpus open source?: Yes — candlewill/Dialog_Corpus is an open-source project tracked on heatdrop.
What language is Dialog_Corpus written in?: candlewill/Dialog_Corpus is primarily written in Python.
How popular is Dialog_Corpus?: candlewill/Dialog_Corpus has 2.1k stars on GitHub.
Where can I find Dialog_Corpus?: candlewill/Dialog_Corpus is on GitHub at https://github.com/candlewill/Dialog_Corpus.