codemayq/chinese-chatbot-corpus
A collection of 8 Chinese open-domain chat corpora aggregated and processed into a unified format for chatbot training.

This repository consolidates and standardizes multiple Chinese dialogue datasets including Douban multi-turn, TV drama subtitles, PTT forum posts, Weibo, and more. It provides processing scripts to extract dialogue turns, convert traditional to simplified Chinese, and unify various source formats into a consistent structure. The primary goal is to eliminate the need for researchers to individually hunt and format these disparate corpora for training Chinese chatbots.