When BERT isn't enough: pre-training specifically to read
A Chinese MRC competition repo that shows domain-specific continued pre-training can squeeze extra points from familiar models.

What it does
This repo releases code and models from top-finishing entries in Chinese machine reading comprehension competitions, chiefly DuReader. The main deliverable is a set of BERT-family checkpoints (RoBERTa-wwm-large, MacBERT-large) that have been further pre-trained on a large corpus of Chinese MRC data spanning medical, legal, military, and general encyclopedic domains.
The interesting bit
The author didn’t just fine-tune; they collected and cleaned web-scale Chinese MRC data, re-labeled missing answer spans with fuzzy F1 matching, and engineered negative samples three different ways (random retrieval, answer-deletion, BM25 hard negatives). The resulting models reportedly gain ~2 F1 points on MRC and ~1 point on classification versus off-the-shelf pre-trained weights. It’s a cookbook for “what if we just pre-train more, but MRC-shaped.”
Key highlights
- Two re-trained large models on Hugging Face:
chinese_pretrain_mrc_roberta_wwm_ext_largeand a MacBERT variant - Data pipeline includes aggressive filtering (context >1024 dropped, HTML-tag ratio capped at 30%) and sliding-window chunking for long documents
- Negative sample construction is deliberately varied: 50% random, 20% answer-stripped, 30% BM25-retrieved hard negatives
- One-shell-script training via
train_bert.sh; supports SQuAD 2.0-style no-answer data with a single flag - Claims multiple third-party top-5 competition finishes on DuReader, legal, and medical leaderboards
Caveats
- Pinned to
transformers==2.10.0for training; the README admits the code was migrated from an earlier custom implementation and some “details” were lost in translation - Data sources are vaguely described as “collected from the web” plus self-crawled pages; reproducibility of the corpus is unclear
- The README’s benchmark table has identical baseline scores for two different base models, which looks like a copy-paste error
Verdict
Worth a look if you’re competing on Chinese MRC or wondering whether task-shaped continued pre-training is worth the effort. Skip if you need rigorous reproducibility or are working primarily in English; the value is in the recipe, not the infrastructure.