← all repositories
thu-coai/CDial-GPT

Chinese GPT-2 for chitchat, pre-washed and ready to fine-tune

A cleaned-up dataset and pre-trained models for Chinese short-text conversation, built because raw social-media dialogue is too noisy to eat straight.

CDial-GPT
Velocity · 7d
+0.9
★ / day
Trend
steady
star history

What it does

CDial-GPT ships two things: the LCCC dataset (millions of Chinese dialogue rounds scraped from Weibo and other sources, then scrubbed for profanity, irrelevant context, and malformed sentences) and a family of 95.5M-parameter GPT/GPT-2 models pre-trained on that data. The code is a fork of HuggingFace’s TransferTransfo, wired for single-GPU or distributed fine-tuning and inference.

The interesting bit

The cleaning pipeline is the real product. The authors fused eight raw corpora, then ran hand-crafted rules plus classifiers to filter noise—producing a “base” variant (stricter, smaller) and a “large” variant (looser, bigger). The models themselves are modest by modern standards, but the two-stage pre-training—first on 1.3B characters of Chinese fiction, then on dialogue—shows an old-school attention to domain adaptation.

Key highlights

  • LCCC-base: ~6.8M utterances; LCCC-large: ~14.5M utterances, both downloadable via HuggingFace datasets or direct links
  • Four model checkpoints on HuggingFace Hub, including GPT and GPT-2 variants fine-tuned on each dataset
  • Supports distributed training out of the box (torch.distributed.launch)
  • Includes infer.py for batch generation and interact.py for command-line chat
  • Community contributions: TensorFlow/Keras port, a Dash web UI, and a data-cleaning framework spun off as separate projects

Caveats

  • README is entirely in Chinese; English speakers will need translation or prior familiarity with the pipeline
  • Models are 95.5M parameters—compact, but not competitive with modern 7B+ instruction-tuned models for quality
  • Some documentation links (Baidu Pan, Google Drive) may require workarounds depending on region

Verdict

Worth a look if you need a lightweight, reproducible Chinese dialogue baseline or care about dataset hygiene. Skip it if you want state-of-the-art generative quality or English-language support.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.