Is MRC_Competition_Dureader open source?

Yes — luhua-rain/MRC_Competition_Dureader is an open-source project tracked on heatdrop.

What language is MRC_Competition_Dureader written in?

luhua-rain/MRC_Competition_Dureader is primarily written in Python.

How popular is MRC_Competition_Dureader?

luhua-rain/MRC_Competition_Dureader has 743 stars on GitHub.

Where can I find MRC_Competition_Dureader?

luhua-rain/MRC_Competition_Dureader is on GitHub at https://github.com/luhua-rain/MRC_Competition_Dureader.

← all repositories

luhua-rain/MRC_Competition_Dureader

When BERT isn't enough: pre-training specifically to read

A Chinese MRC competition repo that shows domain-specific continued pre-training can squeeze extra points from familiar models.

★743 stars Python Language Models Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repo releases code and models from top-finishing entries in Chinese machine reading comprehension competitions, chiefly DuReader. The main deliverable is a set of BERT-family checkpoints (RoBERTa-wwm-large, MacBERT-large) that have been further pre-trained on a large corpus of Chinese MRC data spanning medical, legal, military, and general encyclopedic domains.

The interesting bit

The author didn’t just fine-tune; they collected and cleaned web-scale Chinese MRC data, re-labeled missing answer spans with fuzzy F1 matching, and engineered negative samples three different ways (random retrieval, answer-deletion, BM25 hard negatives). The resulting models reportedly gain ~2 F1 points on MRC and ~1 point on classification versus off-the-shelf pre-trained weights. It’s a cookbook for “what if we just pre-train more, but MRC-shaped.”

Key highlights

Two re-trained large models on Hugging Face: chinese_pretrain_mrc_roberta_wwm_ext_large and a MacBERT variant
Data pipeline includes aggressive filtering (context >1024 dropped, HTML-tag ratio capped at 30%) and sliding-window chunking for long documents
Negative sample construction is deliberately varied: 50% random, 20% answer-stripped, 30% BM25-retrieved hard negatives
One-shell-script training via train_bert.sh; supports SQuAD 2.0-style no-answer data with a single flag
Claims multiple third-party top-5 competition finishes on DuReader, legal, and medical leaderboards

Caveats

Pinned to transformers==2.10.0 for training; the README admits the code was migrated from an earlier custom implementation and some “details” were lost in translation
Data sources are vaguely described as “collected from the web” plus self-crawled pages; reproducibility of the corpus is unclear
The README’s benchmark table has identical baseline scores for two different base models, which looks like a copy-paste error

Verdict

Worth a look if you’re competing on Chinese MRC or wondering whether task-shaped continued pre-training is worth the effort. Skip if you need rigorous reproducibility or are working primarily in English; the value is in the recipe, not the infrastructure.

Frequently asked

What is luhua-rain/MRC_Competition_Dureader?: A Chinese MRC competition repo that shows domain-specific continued pre-training can squeeze extra points from familiar models.
Is MRC_Competition_Dureader open source?: Yes — luhua-rain/MRC_Competition_Dureader is an open-source project tracked on heatdrop.
What language is MRC_Competition_Dureader written in?: luhua-rain/MRC_Competition_Dureader is primarily written in Python.
How popular is MRC_Competition_Dureader?: luhua-rain/MRC_Competition_Dureader has 743 stars on GitHub.
Where can I find MRC_Competition_Dureader?: luhua-rain/MRC_Competition_Dureader is on GitHub at https://github.com/luhua-rain/MRC_Competition_Dureader.