Chinese GPT-2 for chitchat, pre-washed and ready to fine-tune
A cleaned-up dataset and pre-trained models for Chinese short-text conversation, built because raw social-media dialogue is too noisy to eat straight.

What it does
CDial-GPT ships two things: the LCCC dataset (millions of Chinese dialogue rounds scraped from Weibo and other sources, then scrubbed for profanity, irrelevant context, and malformed sentences) and a family of 95.5M-parameter GPT/GPT-2 models pre-trained on that data. The code is a fork of HuggingFace’s TransferTransfo, wired for single-GPU or distributed fine-tuning and inference.
The interesting bit
The cleaning pipeline is the real product. The authors fused eight raw corpora, then ran hand-crafted rules plus classifiers to filter noise—producing a “base” variant (stricter, smaller) and a “large” variant (looser, bigger). The models themselves are modest by modern standards, but the two-stage pre-training—first on 1.3B characters of Chinese fiction, then on dialogue—shows an old-school attention to domain adaptation.
Key highlights
- LCCC-base: ~6.8M utterances; LCCC-large: ~14.5M utterances, both downloadable via HuggingFace
datasetsor direct links - Four model checkpoints on HuggingFace Hub, including GPT and GPT-2 variants fine-tuned on each dataset
- Supports distributed training out of the box (
torch.distributed.launch) - Includes
infer.pyfor batch generation andinteract.pyfor command-line chat - Community contributions: TensorFlow/Keras port, a Dash web UI, and a data-cleaning framework spun off as separate projects
Caveats
- README is entirely in Chinese; English speakers will need translation or prior familiarity with the pipeline
- Models are 95.5M parameters—compact, but not competitive with modern 7B+ instruction-tuned models for quality
- Some documentation links (Baidu Pan, Google Drive) may require workarounds depending on region
Verdict
Worth a look if you need a lightweight, reproducible Chinese dialogue baseline or care about dataset hygiene. Skip it if you want state-of-the-art generative quality or English-language support.