Is CDial-GPT open source?

Yes — thu-coai/CDial-GPT is open source, released under the MIT license.

What language is CDial-GPT written in?

thu-coai/CDial-GPT is primarily written in Python.

How popular is CDial-GPT?

thu-coai/CDial-GPT has 2k stars on GitHub.

Where can I find CDial-GPT?

thu-coai/CDial-GPT is on GitHub at https://github.com/thu-coai/CDial-GPT.

← all repositories

thu-coai/CDial-GPT

Chinese GPT-2 for chitchat, pre-washed and ready to fine-tune

A cleaned-up dataset and pre-trained models for Chinese short-text conversation, built because raw social-media dialogue is too noisy to eat straight.

★2k stars Python Language Models Chat Assistants Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

CDial-GPT ships two things: the LCCC dataset (millions of Chinese dialogue rounds scraped from Weibo and other sources, then scrubbed for profanity, irrelevant context, and malformed sentences) and a family of 95.5M-parameter GPT/GPT-2 models pre-trained on that data. The code is a fork of HuggingFace’s TransferTransfo, wired for single-GPU or distributed fine-tuning and inference.

The interesting bit

The cleaning pipeline is the real product. The authors fused eight raw corpora, then ran hand-crafted rules plus classifiers to filter noise—producing a “base” variant (stricter, smaller) and a “large” variant (looser, bigger). The models themselves are modest by modern standards, but the two-stage pre-training—first on 1.3B characters of Chinese fiction, then on dialogue—shows an old-school attention to domain adaptation.

Key highlights

LCCC-base: ~6.8M utterances; LCCC-large: ~14.5M utterances, both downloadable via HuggingFace datasets or direct links
Four model checkpoints on HuggingFace Hub, including GPT and GPT-2 variants fine-tuned on each dataset
Supports distributed training out of the box (torch.distributed.launch)
Includes infer.py for batch generation and interact.py for command-line chat
Community contributions: TensorFlow/Keras port, a Dash web UI, and a data-cleaning framework spun off as separate projects

Caveats

README is entirely in Chinese; English speakers will need translation or prior familiarity with the pipeline
Models are 95.5M parameters—compact, but not competitive with modern 7B+ instruction-tuned models for quality
Some documentation links (Baidu Pan, Google Drive) may require workarounds depending on region

Verdict

Worth a look if you need a lightweight, reproducible Chinese dialogue baseline or care about dataset hygiene. Skip it if you want state-of-the-art generative quality or English-language support.

Frequently asked

What is thu-coai/CDial-GPT?: A cleaned-up dataset and pre-trained models for Chinese short-text conversation, built because raw social-media dialogue is too noisy to eat straight.
Is CDial-GPT open source?: Yes — thu-coai/CDial-GPT is open source, released under the MIT license.
What language is CDial-GPT written in?: thu-coai/CDial-GPT is primarily written in Python.
How popular is CDial-GPT?: thu-coai/CDial-GPT has 2k stars on GitHub.
Where can I find CDial-GPT?: thu-coai/CDial-GPT is on GitHub at https://github.com/thu-coai/CDial-GPT.