Is MacBERT open source?

Yes — ymcui/MacBERT is open source, released under the Apache-2.0 license.

How popular is MacBERT?

ymcui/MacBERT has 717 stars on GitHub.

Where can I find MacBERT?

ymcui/MacBERT is on GitHub at https://github.com/ymcui/MacBERT.

ymcui/MacBERT

A Chinese BERT that ditches the [MASK] token

A drop-in Chinese BERT replacement that swaps [MASK] tokens for similar real words, closing the gap between pretraining and downstream tasks.

★717 stars Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

MacBERT is a Chinese pre-trained language model that keeps BERT’s architecture but changes its masking strategy. Instead of hiding words behind the [MASK] token—a symbol that never appears in real tasks—it replaces masked words with similar vocabulary items drawn from a word2vec-based synonym toolkit. It also applies Whole Word Masking and N-gram masking. The released base and large models slot directly into existing code using standard BertTokenizer and BertModel classes.

The interesting bit

The fix is almost embarrassingly direct: BERT spends pretraining learning to predict [MASK], but [MASK] is a training-only artifact. MacBERT removes it entirely, reframing the task as word correction rather than blank-filling. When the synonym lookup comes up empty, it falls back to a random word, which keeps the model from relying on special-token crutches.

Key highlights

Drop-in replacement: identical architecture to BERT, loadable via Hugging Face Transformers using the standard BertModel and BertTokenizer classes
Two model sizes: base (102M parameters) and large (324M parameters)
Evaluated on six Chinese NLP benchmarks—CMRC 2018, DRCD, XNLI, ChnSentiCorp, LCQMC, and BQ Corpus—where it generally outperforms BERT and BERT-wwm baselines
Published in Findings of EMNLP 2020 by researchers at Harbin Institute of Technology and iFLYTEK
Combines the masking fix with Whole Word Masking and N-gram masking

Caveats

Gains are not universal; on the LCQMC sentence-pair matching task, ELECTRA-base scores higher than MacBERT-base despite the same parameter budget
The synonym substitution depends on an external word2vec toolkit, and the README does not quantify how often the random-word fallback is triggered

Verdict

Chinese NLP practitioners looking for a painless BERT upgrade should swap in MacBERT. Teams already running ELECTRA or RoBERTa-wwm-ext may find the improvements incremental.

Frequently asked

What is ymcui/MacBERT?: A drop-in Chinese BERT replacement that swaps [MASK] tokens for similar real words, closing the gap between pretraining and downstream tasks.
Is MacBERT open source?: Yes — ymcui/MacBERT is open source, released under the Apache-2.0 license.
How popular is MacBERT?: ymcui/MacBERT has 717 stars on GitHub.
Where can I find MacBERT?: ymcui/MacBERT is on GitHub at https://github.com/ymcui/MacBERT.