Topic modeling that actually reads the room
Top2Vec finds topics automatically, then lets you search them like a semantic index.

What it does
Top2Vec trains on a pile of text, figures out how many topics are actually in there, and builds a shared vector space where documents, words, and topics all live together. Once trained, you can query topics by keyword, find similar documents, or rummage through related words without building a separate search pipeline.
The interesting bit
The newer “Contextual Top2Vec” (beta) goes document-level down to token-level: it assigns topics to individual tokens and finds topic spans inside single documents. That means a long document covering, say, both cryptocurrency and sourdough can be segmented by theme rather than dumped into one bucket. The catch: it only works with two specific sentence-transformer models right now.
Key highlights
- Automatically detects topic count; no hand-tuning
klike LDA - No stopword lists or stemming required
- Joint embedding means topics, documents, and words are directly comparable in the same space
- Supports Doc2Vec, Universal Sentence Encoder, or BERT sentence transformers
- New contextual mode adds per-token topic assignments and document topic distributions
- Built-in semantic search: query documents by topic or keyword
Caveats
- Contextual Top2Vec is explicitly marked beta; the README warns of “issues or unexpected behavior”
- Contextual mode is locked to
all-MiniLM-L6-v2orall-mpnet-base-v2; no other embedding models - Doc2Vec may outperform pretrained encoders on large or highly specialized vocabularies, but you’ll wait longer
Verdict
Worth a look if you need topic modeling without the usual hyperparameter grief, especially if semantic search over those topics is part of the job. Skip it if you need production-grade token-level segmentation today, or if your stack can’t accommodate the contextual mode’s limited model choices.