← all repositories

codemayq/chinese-chatbot-corpus

A collection of 8 Chinese open-domain chat corpora aggregated and processed into a unified format for chatbot training.

4.2k stars Python Data Tooling
chinese-chatbot-corpus
Velocity · 7d
+1.5
★ / day
Trend
steady
star history

This repository consolidates and standardizes multiple Chinese dialogue datasets including Douban multi-turn, TV drama subtitles, PTT forum posts, Weibo, and more. It provides processing scripts to extract dialogue turns, convert traditional to simplified Chinese, and unify various source formats into a consistent structure. The primary goal is to eliminate the need for researchers to individually hunt and format these disparate corpora for training Chinese chatbots.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.