Is CLUECorpus2020 open source?

Yes — CLUEbenchmark/CLUECorpus2020 is open source, released under the MIT license.

How popular is CLUECorpus2020?

CLUEbenchmark/CLUECorpus2020 has 1k stars on GitHub.

Where can I find CLUECorpus2020?

CLUEbenchmark/CLUECorpus2020 is on GitHub at https://github.com/CLUEbenchmark/CLUECorpus2020.

← all repositories

CLUEbenchmark/CLUECorpus2020

100GB of Chinese text, scrubbed from the web's attic

A cleaned, ready-to-train corpus for Chinese NLP that ships with a smaller, simplified-Chinese-only vocabulary.

★1k stars Data Tooling Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does CLUECorpus2020 is a 100GB Chinese pre-training dataset scraped from Common Crawl and run through a cleaning pipeline. It also offers a 14GB “small” variant built from news, web text, Wikipedia, and comments, already chunked into pre-training-friendly 4MB text files. The project includes a custom 8,021-token vocabulary stripped of traditional Chinese, Japanese, Korean, and emoji tokens.

The interesting bit The vocabulary shrink is the quiet win. By dropping non-simplified characters and shrinking the English token count, they cut the total vocabulary by 62% compared to Google’s original. The README shows benchmark results suggesting this leaner vocab can match or edge out the full Google set on tasks like CMNLI, at least at smaller data scales.

Key highlights

100GB main corpus via email application; 14GB small corpus with direct Baidu Pan links
Pre-formatted for BERT-style training: one sentence per line, blank lines between documents
Custom vocab: 5,689 simplified Chinese tokens vs. Google’s 11,378
Small corpus sources: news (8GB), web community text (3GB), Wikipedia (1.1GB), reviews (2.3GB)
Backed by a 2020 arXiv paper and linked to companion pretrained model releases

Caveats

The 100GB corpus requires emailing CLUEbenchmark@163.com with a research proposal; no direct download
README benchmarks only cover 1–3GB subsets and 125K–375K steps, not the full 100GB scale
Baidu Pan links for the small corpus may require a Chinese phone number to access

Verdict Worth a look if you’re training Chinese language models from scratch and want data that’s already been through someone else’s cleaning pipeline. Skip it if you need traditional Chinese, multilingual tokens, or instant download without bureaucracy.

Frequently asked

What is CLUEbenchmark/CLUECorpus2020?: A cleaned, ready-to-train corpus for Chinese NLP that ships with a smaller, simplified-Chinese-only vocabulary.
Is CLUECorpus2020 open source?: Yes — CLUEbenchmark/CLUECorpus2020 is open source, released under the MIT license.
How popular is CLUECorpus2020?: CLUEbenchmark/CLUECorpus2020 has 1k stars on GitHub.
Where can I find CLUECorpus2020?: CLUEbenchmark/CLUECorpus2020 is on GitHub at https://github.com/CLUEbenchmark/CLUECorpus2020.