Is dont-stop-pretraining open source?

Yes — allenai/dont-stop-pretraining is an open-source project tracked on heatdrop.

What language is dont-stop-pretraining written in?

allenai/dont-stop-pretraining is primarily written in Python.

How popular is dont-stop-pretraining?

allenai/dont-stop-pretraining has 543 stars on GitHub.

Where can I find dont-stop-pretraining?

allenai/dont-stop-pretraining is on GitHub at https://github.com/allenai/dont-stop-pretraining.

← all repositories

allenai/dont-stop-pretraining

Keep pretraining your BERT, but actually do it right

Research code showing that more pretraining on your actual domain beats generic off-the-shelf models—sometimes dramatically.

★543 stars Python Language Models Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This is the reproducibility code for the ACL 2020 paper “Don’t Stop Pretraining.” It trains RoBERTa classifiers on niche tasks—biomedical papers, CS citation intent, movie reviews—using three strategies: plain pretrained models, domain-adaptive pretraining (DAPT, more pretraining on broad domain text), and task-adaptive pretraining (TAPT, more pretraining on the actual task corpus itself). The repo includes all training scripts, configs, and links to download their resulting models.

The interesting bit

The paper’s core finding is that TAPT—just pretraining longer on your unlabeled task data—often beats DAPT, and combining both can beat generic RoBERTa by meaningful margins. The code lets you test this yourself rather than trust a bar chart.

Key highlights

Pretrained models hosted on Hugging Face: four DAPT variants (CS, biomedical, reviews, news) and ~20 TAPT combinations
Automated dataset downloading from S3 via scripts/train.py
Hyperparameter search integration via allentune
Branch latest-allennlp for modern transformers usage; main branch pinned to 2020-era pytorch-transformers==1.2.0 for exact reproducibility

Caveats

Main branch uses a crusty pinned AllenNLP that requires manual model downloads; the modern branch is explicitly untested for all models
No performance numbers or comparisons in the README itself—you’ll need the paper for the actual results
Curated TAPT models (larger dataset sizes) only exist for three of the nine tasks

Verdict

Worth a look if you’re building NLP pipelines in specialized domains and suspect your generic BERT is leaving points on the table. Skip if you need a plug-and-play library; this is research scaffolding, not a product.

Frequently asked

What is allenai/dont-stop-pretraining?: Research code showing that more pretraining on your actual domain beats generic off-the-shelf models—sometimes dramatically.
Is dont-stop-pretraining open source?: Yes — allenai/dont-stop-pretraining is an open-source project tracked on heatdrop.
What language is dont-stop-pretraining written in?: allenai/dont-stop-pretraining is primarily written in Python.
How popular is dont-stop-pretraining?: allenai/dont-stop-pretraining has 543 stars on GitHub.
Where can I find dont-stop-pretraining?: allenai/dont-stop-pretraining is on GitHub at https://github.com/allenai/dont-stop-pretraining.