← all repositories
ncbi-nlp/bluebert

BERT went to med school and actually paid attention

A BERT variant pre-trained on 4 billion words of PubMed abstracts and clinical notes, because general-domain language models struggle with medical jargon.

593 stars Python Language ModelsDomain Apps
bluebert
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

BlueBERT is a BERT checkpoint continued-pretrained on biomedical text: ~4 billion words from PubMed abstracts plus MIMIC-III clinical notes. The NCBI team offers base and large variants, with or without the clinical data mix, plus fine-tuning scripts for five NLP tasks—sentence similarity, NER, relation extraction, document classification, and inference.

The interesting bit

The real value isn’t the architecture; it’s the data curation and the explicit comparison. The authors preprocessed PubMed with surgical modesty—lowercasing, ASCII filtering, NLTK tokenization—then published both the corpus and the exact pretraining commands. You can reproduce from scratch or grab the HuggingFace weights.

Key highlights

  • Four model variants: Base/Large × PubMed-only/PubMed+MIMIC-III
  • Weights hosted on both NCBI FTP and HuggingFace Hub
  • Fine-tuning scripts included for 5 biomedical NLP tasks (STS, NER, RE, classification, NLI)
  • Preprocessed ~4B-word PubMed corpus available for download
  • Evaluated on 10 benchmarking datasets against ELMo and general BERT

Caveats

  • Code appears to be thin wrappers around Google’s original BERT scripts; not a standalone framework
  • Last meaningful update was 2020 (HuggingFace migration); repository looks dormant
  • Clinical use requires the usual MIMIC-III credentialing dance

Verdict

Worth a look if you’re doing biomedical NLP and need a battle-tested starting point. Skip it if you want a modern, maintained library—this is essentially a model zoo with glue scripts, not a framework.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.