Chinese NER: when BERT meets CRF and nobody gets hurt
A straightforward PyTorch baseline for CLUENER2020 that stacks BiLSTM, BERT, and RoBERTa with optional CRF layers to see what actually moves the needle on fine-grained Chinese entity recognition.

What it does
This repo implements four baseline architectures for the CLUENER2020 Chinese NER task: vanilla BiLSTM-CRF, BERT with softmax, BERT-CRF, and BERT-BiLSTM-CRF. Swap in RoBERTa-wwm-ext-large for BERT and you get the RoBERTa variants. It is essentially a clean, runnable reference implementation with a results table.
The interesting bit
The README is admirably honest about its own limitations. The author notes the dataset has quality issues, admits to using the validation set as a test set because the real test set is locked behind a limited-submission leaderboard, and even flags that you must manually move train.log before re-running or it gets overwritten. This is baseline code that knows it is baseline code.
Key highlights
- Four model variants with clear F1 score breakdowns across 10 entity types (address, book, company, game, government, movie, name, organization, position, scene)
- RoBERTa-wwm-ext-large + BiLSTM + CRF edges out pure RoBERTa-CRF overall (79.64 vs 79.34 F1), though the gap is narrow and category-dependent
- Requires manual BERT/RoBERTa model download and TensorFlow-to-PyTorch conversion, with a Baidu Netdisk link provided for the impatient
- Built on
transformers==2.2.2and PyTorch 1.5.1 — versions that feel increasingly archaeological
Caveats
transformers==2.2.2is pinned; upgrading likely breaks things- The “test set” is really the validation set, so numbers are not directly comparable to official leaderboard submissions
- Data quality issues in CLUENER2020 are acknowledged but not mitigated in code
Verdict
Grab this if you need a working Chinese NER starter in PyTorch and want to see how much CRF and BiLSTM stacking actually help on top of a large pretrained model. Skip it if you need production-ready code, modern dependency versions, or rigorous evaluation against the true test set.