← all repositories
luhua-rain/MRC_Competition_Dureader

When BERT isn't enough: pre-training specifically to read

A Chinese MRC competition repo that shows domain-specific continued pre-training can squeeze extra points from familiar models.

743 stars Python Language ModelsData Tooling
MRC_Competition_Dureader
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

This repo releases code and models from top-finishing entries in Chinese machine reading comprehension competitions, chiefly DuReader. The main deliverable is a set of BERT-family checkpoints (RoBERTa-wwm-large, MacBERT-large) that have been further pre-trained on a large corpus of Chinese MRC data spanning medical, legal, military, and general encyclopedic domains.

The interesting bit

The author didn’t just fine-tune; they collected and cleaned web-scale Chinese MRC data, re-labeled missing answer spans with fuzzy F1 matching, and engineered negative samples three different ways (random retrieval, answer-deletion, BM25 hard negatives). The resulting models reportedly gain ~2 F1 points on MRC and ~1 point on classification versus off-the-shelf pre-trained weights. It’s a cookbook for “what if we just pre-train more, but MRC-shaped.”

Key highlights

  • Two re-trained large models on Hugging Face: chinese_pretrain_mrc_roberta_wwm_ext_large and a MacBERT variant
  • Data pipeline includes aggressive filtering (context >1024 dropped, HTML-tag ratio capped at 30%) and sliding-window chunking for long documents
  • Negative sample construction is deliberately varied: 50% random, 20% answer-stripped, 30% BM25-retrieved hard negatives
  • One-shell-script training via train_bert.sh; supports SQuAD 2.0-style no-answer data with a single flag
  • Claims multiple third-party top-5 competition finishes on DuReader, legal, and medical leaderboards

Caveats

  • Pinned to transformers==2.10.0 for training; the README admits the code was migrated from an earlier custom implementation and some “details” were lost in translation
  • Data sources are vaguely described as “collected from the web” plus self-crawled pages; reproducibility of the corpus is unclear
  • The README’s benchmark table has identical baseline scores for two different base models, which looks like a copy-paste error

Verdict

Worth a look if you’re competing on Chinese MRC or wondering whether task-shaped continued pre-training is worth the effort. Skip if you need rigorous reproducibility or are working primarily in English; the value is in the recipe, not the infrastructure.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.