Is Top2Vec open source?

Yes — ddangelov/Top2Vec is open source, released under the BSD-3-Clause license.

What language is Top2Vec written in?

ddangelov/Top2Vec is primarily written in Python.

How popular is Top2Vec?

ddangelov/Top2Vec has 3.1k stars on GitHub.

Where can I find Top2Vec?

ddangelov/Top2Vec is on GitHub at https://github.com/ddangelov/Top2Vec.

← all repositories

ddangelov/Top2Vec

Topic modeling that actually reads the room

Top2Vec finds topics automatically, then lets you search them like a semantic index.

★3.1k stars Python RAG · Search Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Top2Vec trains on a pile of text, figures out how many topics are actually in there, and builds a shared vector space where documents, words, and topics all live together. Once trained, you can query topics by keyword, find similar documents, or rummage through related words without building a separate search pipeline.

The interesting bit

The newer “Contextual Top2Vec” (beta) goes document-level down to token-level: it assigns topics to individual tokens and finds topic spans inside single documents. That means a long document covering, say, both cryptocurrency and sourdough can be segmented by theme rather than dumped into one bucket. The catch: it only works with two specific sentence-transformer models right now.

Key highlights

Automatically detects topic count; no hand-tuning k like LDA
No stopword lists or stemming required
Joint embedding means topics, documents, and words are directly comparable in the same space
Supports Doc2Vec, Universal Sentence Encoder, or BERT sentence transformers
New contextual mode adds per-token topic assignments and document topic distributions
Built-in semantic search: query documents by topic or keyword

Caveats

Contextual Top2Vec is explicitly marked beta; the README warns of “issues or unexpected behavior”
Contextual mode is locked to all-MiniLM-L6-v2 or all-mpnet-base-v2; no other embedding models
Doc2Vec may outperform pretrained encoders on large or highly specialized vocabularies, but you’ll wait longer

Verdict

Worth a look if you need topic modeling without the usual hyperparameter grief, especially if semantic search over those topics is part of the job. Skip it if you need production-grade token-level segmentation today, or if your stack can’t accommodate the contextual mode’s limited model choices.

Frequently asked

What is ddangelov/Top2Vec?: Top2Vec finds topics automatically, then lets you search them like a semantic index.
Is Top2Vec open source?: Yes — ddangelov/Top2Vec is open source, released under the BSD-3-Clause license.
What language is Top2Vec written in?: ddangelov/Top2Vec is primarily written in Python.
How popular is Top2Vec?: ddangelov/Top2Vec has 3.1k stars on GitHub.
Where can I find Top2Vec?: ddangelov/Top2Vec is on GitHub at https://github.com/ddangelov/Top2Vec.