Keep pretraining your BERT, but actually do it right
Research code showing that more pretraining on your actual domain beats generic off-the-shelf models—sometimes dramatically.

What it does
This is the reproducibility code for the ACL 2020 paper “Don’t Stop Pretraining.” It trains RoBERTa classifiers on niche tasks—biomedical papers, CS citation intent, movie reviews—using three strategies: plain pretrained models, domain-adaptive pretraining (DAPT, more pretraining on broad domain text), and task-adaptive pretraining (TAPT, more pretraining on the actual task corpus itself). The repo includes all training scripts, configs, and links to download their resulting models.
The interesting bit
The paper’s core finding is that TAPT—just pretraining longer on your unlabeled task data—often beats DAPT, and combining both can beat generic RoBERTa by meaningful margins. The code lets you test this yourself rather than trust a bar chart.
Key highlights
- Pretrained models hosted on Hugging Face: four DAPT variants (CS, biomedical, reviews, news) and ~20 TAPT combinations
- Automated dataset downloading from S3 via
scripts/train.py - Hyperparameter search integration via
allentune - Branch
latest-allennlpfor modern transformers usage; main branch pinned to 2020-erapytorch-transformers==1.2.0for exact reproducibility
Caveats
- Main branch uses a crusty pinned AllenNLP that requires manual model downloads; the modern branch is explicitly untested for all models
- No performance numbers or comparisons in the README itself—you’ll need the paper for the actual results
- Curated TAPT models (larger dataset sizes) only exist for three of the nine tasks
Verdict
Worth a look if you’re building NLP pipelines in specialized domains and suspect your generic BERT is leaving points on the table. Skip if you need a plug-and-play library; this is research scaffolding, not a product.