Is CLUEPretrainedModels open source?

Yes — CLUEbenchmark/CLUEPretrainedModels is an open-source project tracked on heatdrop.

What language is CLUEPretrainedModels written in?

CLUEbenchmark/CLUEPretrainedModels is primarily written in Python.

How popular is CLUEPretrainedModels?

CLUEbenchmark/CLUEPretrainedModels has 810 stars on GitHub.

Where can I find CLUEPretrainedModels?

CLUEbenchmark/CLUEPretrainedModels is on GitHub at https://github.com/CLUEbenchmark/CLUEPretrainedModels.

← all repositories

CLUEbenchmark/CLUEPretrainedModels

A Chinese model buffet: large, 8× fast tiny, and pair variants

It trains Chinese RoBERTa models on 35 billion characters of Common Crawl and ships them in three sizes: large, 8×-fast-tiny, and sentence-pair-optimized.

★810 stars Python Language Models Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does CLUEPretrainedModels releases a family of Chinese RoBERTa checkpoints built on the organization’s own 100 GB Common-Crawl corpus—35 billion characters—and a custom 8,000-token vocabulary that is one-third the size of Google’s Chinese BERT vocab. The repo provides large weights that edge out RoBERTa-wwm-large on CLUE classification benchmarks, tiny weights down to 7.5 million parameters that train roughly eight times faster than BERT-base, and specialized “pair” variants tuned for semantic-similarity tasks. Everything keeps the standard BERT-base structure so you can load it with the usual Hugging Face Transformers classes.

The interesting bit The real savings come from the vocabulary diet. Trimming the vocab to 8K shrinks the token table by 62 percent and, per the project’s own speed benchmarks, accelerates BERT-base training by about 15 percent before you even shrink model depth. The tiny variants then slash total parameters by 92.6 percent and push the speedup to 8×, while still outperforming ALBERT-tiny on the tested CLUE tasks.

Key highlights

Large RoBERTa (290M params) scores 69.68% on the tested CLUE aggregate, nudging past RoBERTa-wwm-large’s 69.46%.
Tiny RoBERTa (7.5M params) trains roughly eight times faster than BERT-base and beats ALBERT-tiny on the same four-task suite.
Dedicated “pair” models for semantic similarity and sentence-pair tasks eke out small but consistent gains on AFQMC (about +0.41%).
All checkpoints keep the standard BERT-base architecture and can be loaded via Hugging Face Transformers BertTokenizer and BertModel.
The underlying CLUECorpus2020 spans 100 GB of raw Common-Crawl text, or about 35 billion Chinese characters.

Caveats

The TODO list explicitly notes the released large model still carries无效参数 (invalid or redundant parameters) that have yet to be stripped.

Verdict Grab these if you need off-the-shelf Chinese RoBERTa weights across the full size spectrum—from 7.5M-parameter tiny to 290M-parameter large. Skip it if you are looking for multilingual or English-only checkpoints; this corpus and vocab are strictly Chinese.

Frequently asked

What is CLUEbenchmark/CLUEPretrainedModels?: It trains Chinese RoBERTa models on 35 billion characters of Common Crawl and ships them in three sizes: large, 8×-fast-tiny, and sentence-pair-optimized.
Is CLUEPretrainedModels open source?: Yes — CLUEbenchmark/CLUEPretrainedModels is an open-source project tracked on heatdrop.
What language is CLUEPretrainedModels written in?: CLUEbenchmark/CLUEPretrainedModels is primarily written in Python.
How popular is CLUEPretrainedModels?: CLUEbenchmark/CLUEPretrainedModels has 810 stars on GitHub.
Where can I find CLUEPretrainedModels?: CLUEbenchmark/CLUEPretrainedModels is on GitHub at https://github.com/CLUEbenchmark/CLUEPretrainedModels.