CLUEbenchmark/CLUEPretrainedModels
Collection of Chinese pre-trained language models including BERT, ALBERT, and RoBERTa variants with distilation for efficiency.

This repository provides a suite of Chinese pre-trained language models developed by CLUEbenchmark. It includes state-of-the-art large models, distilled small models achieving 8x speedup over Bert-base, and specialized semantic similarity models. The models are pre-trained on CLUECorpus2020, a 100GB Chinese corpus with 35 billion characters sourced from Common Crawl, using a compact 8K vocabulary that reduces computational cost while maintaining strong performance on Chinese NLP benchmarks.