← all repositories
allenai/dont-stop-pretraining

Keep pretraining your BERT, but actually do it right

Research code showing that more pretraining on your actual domain beats generic off-the-shelf models—sometimes dramatically.

543 stars Python Language ModelsData Tooling
dont-stop-pretraining
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This is the reproducibility code for the ACL 2020 paper “Don’t Stop Pretraining.” It trains RoBERTa classifiers on niche tasks—biomedical papers, CS citation intent, movie reviews—using three strategies: plain pretrained models, domain-adaptive pretraining (DAPT, more pretraining on broad domain text), and task-adaptive pretraining (TAPT, more pretraining on the actual task corpus itself). The repo includes all training scripts, configs, and links to download their resulting models.

The interesting bit

The paper’s core finding is that TAPT—just pretraining longer on your unlabeled task data—often beats DAPT, and combining both can beat generic RoBERTa by meaningful margins. The code lets you test this yourself rather than trust a bar chart.

Key highlights

  • Pretrained models hosted on Hugging Face: four DAPT variants (CS, biomedical, reviews, news) and ~20 TAPT combinations
  • Automated dataset downloading from S3 via scripts/train.py
  • Hyperparameter search integration via allentune
  • Branch latest-allennlp for modern transformers usage; main branch pinned to 2020-era pytorch-transformers==1.2.0 for exact reproducibility

Caveats

  • Main branch uses a crusty pinned AllenNLP that requires manual model downloads; the modern branch is explicitly untested for all models
  • No performance numbers or comparisons in the README itself—you’ll need the paper for the actual results
  • Curated TAPT models (larger dataset sizes) only exist for three of the nine tasks

Verdict

Worth a look if you’re building NLP pipelines in specialized domains and suspect your generic BERT is leaving points on the table. Skip if you need a plug-and-play library; this is research scaffolding, not a product.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.