Is BioSentVec open source?

Yes — ncbi-nlp/BioSentVec is an open-source project tracked on heatdrop.

What language is BioSentVec written in?

ncbi-nlp/BioSentVec is primarily written in Jupyter Notebook.

How popular is BioSentVec?

ncbi-nlp/BioSentVec has 615 stars on GitHub.

Where can I find BioSentVec?

ncbi-nlp/BioSentVec is on GitHub at https://github.com/ncbi-nlp/BioSentVec.

← all repositories

ncbi-nlp/BioSentVec

Pre-trained embeddings that actually speak doctor

NCBI's BioWordVec and BioSentVec offer biomedical word and sentence embeddings trained on PubMed and clinical notes, because generic vectors choke on 'myocardial infarction'.

★615 stars Jupyter Notebook Language Models Data Tooling Domain Apps

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

The repo distributes two pre-trained embedding models: BioWordVec (200-dim word vectors via fastText) and BioSentVec (700-dim sentence vectors via sent2vec). Both were trained on a combined corpus of 28.7M PubMed abstracts and 2.1M MIMIC-III clinical notes—about 4.9 billion tokens total. You download the binaries and load them; the word model handles out-of-vocabulary terms via fastText’s subword approach.

The interesting bit

The sentence embeddings use sent2vec, a less common choice that trains word and n-gram embeddings jointly to represent entire sentences as bag-of-embeddings averages. The README’s evaluation tables are unusually honest: BioSentVec trained only on MIMIC-III clinical notes actually tanks on the BIOSSES biomedical similarity benchmark (0.350 vs. 0.795 for the combined corpus), showing that clinical shorthand and scientific prose are different languages.

Key highlights

Word model: 13GB vectors, 26GB full model; sentence model: 21GB
Evaluated on actual biomedical benchmarks: MayoSRS, UMNSRS, BIOSSES, MedSTS
Combined PubMed + MIMIC-III training generally outperforms either corpus alone
Includes a Jupyter tutorial for loading and using the models
FAQ and loading instructions live in the project Wiki

Caveats

Files are enormous; you’ll need the disk space and patience for FTP downloads from NCBI
The README doesn’t specify hardware requirements or inference speed benchmarks
Sent2vec is less actively maintained than sentence-transformers; ecosystem support may be thinner

Verdict

Grab these if you’re building biomedical NLP and need off-the-shelf vectors without fine-tuning BERT. Skip if you’re memory-constrained or want modern contextual embeddings—this is 2019-era static embedding technology, competently executed and thoroughly evaluated for its era.

Frequently asked

What is ncbi-nlp/BioSentVec?: NCBI's BioWordVec and BioSentVec offer biomedical word and sentence embeddings trained on PubMed and clinical notes, because generic vectors choke on 'myocardial infarction'.
Is BioSentVec open source?: Yes — ncbi-nlp/BioSentVec is an open-source project tracked on heatdrop.
What language is BioSentVec written in?: ncbi-nlp/BioSentVec is primarily written in Jupyter Notebook.
How popular is BioSentVec?: ncbi-nlp/BioSentVec has 615 stars on GitHub.
Where can I find BioSentVec?: ncbi-nlp/BioSentVec is on GitHub at https://github.com/ncbi-nlp/BioSentVec.