Pre-trained embeddings that actually speak doctor
NCBI's BioWordVec and BioSentVec offer biomedical word and sentence embeddings trained on PubMed and clinical notes, because generic vectors choke on 'myocardial infarction'.

What it does
The repo distributes two pre-trained embedding models: BioWordVec (200-dim word vectors via fastText) and BioSentVec (700-dim sentence vectors via sent2vec). Both were trained on a combined corpus of 28.7M PubMed abstracts and 2.1M MIMIC-III clinical notes—about 4.9 billion tokens total. You download the binaries and load them; the word model handles out-of-vocabulary terms via fastText’s subword approach.
The interesting bit
The sentence embeddings use sent2vec, a less common choice that trains word and n-gram embeddings jointly to represent entire sentences as bag-of-embeddings averages. The README’s evaluation tables are unusually honest: BioSentVec trained only on MIMIC-III clinical notes actually tanks on the BIOSSES biomedical similarity benchmark (0.350 vs. 0.795 for the combined corpus), showing that clinical shorthand and scientific prose are different languages.
Key highlights
- Word model: 13GB vectors, 26GB full model; sentence model: 21GB
- Evaluated on actual biomedical benchmarks: MayoSRS, UMNSRS, BIOSSES, MedSTS
- Combined PubMed + MIMIC-III training generally outperforms either corpus alone
- Includes a Jupyter tutorial for loading and using the models
- FAQ and loading instructions live in the project Wiki
Caveats
- Files are enormous; you’ll need the disk space and patience for FTP downloads from NCBI
- The README doesn’t specify hardware requirements or inference speed benchmarks
- Sent2vec is less actively maintained than sentence-transformers; ecosystem support may be thinner
Verdict
Grab these if you’re building biomedical NLP and need off-the-shelf vectors without fine-tuning BERT. Skip if you’re memory-constrained or want modern contextual embeddings—this is 2019-era static embedding technology, competently executed and thoroughly evaluated for its era.