← all repositories
CLUEbenchmark/CLUECorpus2020

100GB of Chinese text, scrubbed from the web's attic

A cleaned, ready-to-train corpus for Chinese NLP that ships with a smaller, simplified-Chinese-only vocabulary.

CLUECorpus2020
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does CLUECorpus2020 is a 100GB Chinese pre-training dataset scraped from Common Crawl and run through a cleaning pipeline. It also offers a 14GB “small” variant built from news, web text, Wikipedia, and comments, already chunked into pre-training-friendly 4MB text files. The project includes a custom 8,021-token vocabulary stripped of traditional Chinese, Japanese, Korean, and emoji tokens.

The interesting bit The vocabulary shrink is the quiet win. By dropping non-simplified characters and shrinking the English token count, they cut the total vocabulary by 62% compared to Google’s original. The README shows benchmark results suggesting this leaner vocab can match or edge out the full Google set on tasks like CMNLI, at least at smaller data scales.

Key highlights

  • 100GB main corpus via email application; 14GB small corpus with direct Baidu Pan links
  • Pre-formatted for BERT-style training: one sentence per line, blank lines between documents
  • Custom vocab: 5,689 simplified Chinese tokens vs. Google’s 11,378
  • Small corpus sources: news (8GB), web community text (3GB), Wikipedia (1.1GB), reviews (2.3GB)
  • Backed by a 2020 arXiv paper and linked to companion pretrained model releases

Caveats

  • The 100GB corpus requires emailing CLUEbenchmark@163.com with a research proposal; no direct download
  • README benchmarks only cover 1–3GB subsets and 125K–375K steps, not the full 100GB scale
  • Baidu Pan links for the small corpus may require a Chinese phone number to access

Verdict Worth a look if you’re training Chinese language models from scratch and want data that’s already been through someone else’s cleaning pipeline. Skip it if you need traditional Chinese, multilingual tokens, or instant download without bureaucracy.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.