← all repositories
ddangelov/Top2Vec

Topic modeling that actually reads the room

Top2Vec finds topics automatically, then lets you search them like a semantic index.

3.1k stars Python RAG · SearchLanguage Models
Top2Vec
Velocity · 7d
+1.4
★ / day
Trend
steady
star history

What it does

Top2Vec trains on a pile of text, figures out how many topics are actually in there, and builds a shared vector space where documents, words, and topics all live together. Once trained, you can query topics by keyword, find similar documents, or rummage through related words without building a separate search pipeline.

The interesting bit

The newer “Contextual Top2Vec” (beta) goes document-level down to token-level: it assigns topics to individual tokens and finds topic spans inside single documents. That means a long document covering, say, both cryptocurrency and sourdough can be segmented by theme rather than dumped into one bucket. The catch: it only works with two specific sentence-transformer models right now.

Key highlights

  • Automatically detects topic count; no hand-tuning k like LDA
  • No stopword lists or stemming required
  • Joint embedding means topics, documents, and words are directly comparable in the same space
  • Supports Doc2Vec, Universal Sentence Encoder, or BERT sentence transformers
  • New contextual mode adds per-token topic assignments and document topic distributions
  • Built-in semantic search: query documents by topic or keyword

Caveats

  • Contextual Top2Vec is explicitly marked beta; the README warns of “issues or unexpected behavior”
  • Contextual mode is locked to all-MiniLM-L6-v2 or all-mpnet-base-v2; no other embedding models
  • Doc2Vec may outperform pretrained encoders on large or highly specialized vocabularies, but you’ll wait longer

Verdict

Worth a look if you need topic modeling without the usual hyperparameter grief, especially if semantic search over those topics is part of the job. Skip it if you need production-grade token-level segmentation today, or if your stack can’t accommodate the contextual mode’s limited model choices.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.