Is Chinese-BERT-wwm open source?

Yes — ymcui/Chinese-BERT-wwm is open source, released under the Apache-2.0 license.

What language is Chinese-BERT-wwm written in?

ymcui/Chinese-BERT-wwm is primarily written in Python.

How popular is Chinese-BERT-wwm?

ymcui/Chinese-BERT-wwm has 10.2k stars on GitHub.

Where can I find Chinese-BERT-wwm?

ymcui/Chinese-BERT-wwm is on GitHub at https://github.com/ymcui/Chinese-BERT-wwm.

← all repositories

ymcui/Chinese-BERT-wwm

BERT for Chinese that respects actual word boundaries

It releases Chinese BERT checkpoints pre-trained with whole-word masking, forcing the model to predict complete segmented words rather than isolated characters.

★10.2k stars Python Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

The repository hosts a family of Chinese pre-trained transformers—BERT-wwm, RoBERTa-wwm-ext, and several distilled 3-to-6-layer variants—built on Google’s original BERT architecture. Instead of masking individual WordPiece tokens (which, for Chinese, usually means single characters), the models use a whole-word masking strategy: if one character in a word is selected for masking, every character in that word is masked. The checkpoints are trained on Chinese Wikipedia and a larger extended corpus totaling 5.4 billion words, and are distributed in TensorFlow and PyTorch formats via HuggingFace.

The interesting bit

The clever part is the marriage of classical Chinese word segmentation with modern masked-language-model pre-training. The authors use the LTP segmenter to identify word boundaries, then treat each segmented word as an atomic masking unit—effectively dragging BERT, originally designed for English subword vocabularies, into respecting Chinese linguistic structure without altering the underlying Transformer architecture.

Key highlights

Models include base, large, and lightweight 3-layer variants (RBT3, RBTL3) for resource-constrained deployment.
All checkpoints load through standard BertTokenizer and BertModel classes, even the RoBERTa-branded ones.
Published work in IEEE/ACM TASLP; maintained by the HFL lab (Harbin Institute of Technology & iFLYTEK).
Distributed via HuggingFace Hub under the hfl namespace and integrated with PaddleHub.
The extended-series models (-ext) train on a mixed corpus of encyclopedia, news, and QA text beyond Wikipedia.

Caveats

The released weights do not include the MLM head weights; if you need masked-language-modeling capabilities out of the box, you must perform secondary pre-training on your own data.
The repository is fundamentally a model zoo and release hub—there is no novel training framework here, just well-documented checkpoints and conversion instructions.

Verdict

Anyone building Chinese NLP pipelines who has outgrown the stock Google BERT-base checkpoint should grab these. Skip it if you are looking for a training library or need ready-to-use MLM weights without extra pre-training.

Frequently asked

What is ymcui/Chinese-BERT-wwm?: It releases Chinese BERT checkpoints pre-trained with whole-word masking, forcing the model to predict complete segmented words rather than isolated characters.
Is Chinese-BERT-wwm open source?: Yes — ymcui/Chinese-BERT-wwm is open source, released under the Apache-2.0 license.
What language is Chinese-BERT-wwm written in?: ymcui/Chinese-BERT-wwm is primarily written in Python.
How popular is Chinese-BERT-wwm?: ymcui/Chinese-BERT-wwm has 10.2k stars on GitHub.
Where can I find Chinese-BERT-wwm?: ymcui/Chinese-BERT-wwm is on GitHub at https://github.com/ymcui/Chinese-BERT-wwm.