← all repositories
explosion/sense2vec

Word2vec with a part-of-speech filter

sense2vec disambiguates "duck" the verb from "duck" the noun by baking context into the vector key itself.

1.7k stars Python ML FrameworksLanguage Models
sense2vec
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does sense2vec trains word vectors where each key is a phrase plus a linguistic sense — like natural_language_processing|NOUN or Facebook|ORG. You can query similar phrases, check frequencies, or plug it into a spaCy pipeline so tokens and spans grow ._.s2v_vec and ._.s2v_most_similar() attributes. It ships with pretrained Reddit comment vectors (2015 and 2019) and can train new ones from raw text using fastText or GloVe under the hood.

The interesting bit The “sense” is just the part-of-speech tag or named-entity label, so the model learns separate embeddings for a word’s different grammatical jobs without any fancy disambiguation network. It’s a 2015 paper by Trask et al. that Explosion dusted off, reimplemented, and wired directly into spaCy v3 as a serializable pipeline component.

Key highlights

  • Standalone class or spaCy nlp.add_pipe("sense2vec") — your call.
  • Extension attributes on Token and Span objects for vector, frequency, and nearest-neighbor lookup.
  • Optional nearest-neighbor caching for faster “most similar” queries.
  • Train your own vectors with a pretrained spaCy model + raw text + fastText/GloVe.
  • Prodigy recipes for bootstrapping similar-phrase lists and NER match patterns.
  • Pretrained Reddit vectors attached to GitHub releases; 2019 set is a 4 GB multi-part download.

Caveats

  • The vector table is case-sensitive and keys must use underscores for spaces (machine_learning|NOUN, not machine learning|NOUN).
  • Span attributes default to the root’s POS tag if no entity label is present, and arbitrary slices may not have keys in the model.
  • spaCy v2 users need the legacy v1.x branch and sense2vec==1.0.3.

Verdict Grab this if you want phrase-level semantic similarity that respects part-of-speech and entity type, especially inside an existing spaCy pipeline. Skip it if you need cross-lingual vectors or modern transformer-scale embeddings — this is classic shallow word2vec with a clever indexing trick.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.