Is contextualized-topic-models open source?

Yes — MilaNLProc/contextualized-topic-models is open source, released under the MIT license.

What language is contextualized-topic-models written in?

MilaNLProc/contextualized-topic-models is primarily written in Python.

How popular is contextualized-topic-models?

MilaNLProc/contextualized-topic-models has 1.3k stars on GitHub.

Where can I find contextualized-topic-models?

MilaNLProc/contextualized-topic-models is on GitHub at https://github.com/MilaNLProc/contextualized-topic-models.

← all repositories

MilaNLProc/contextualized-topic-models

BERT meets bag-of-words: topic modeling that actually reads context

A Python package that bolts contextual embeddings onto classical topic models so your topics stop being a jumble of unrelated words.

★1.3k stars Python Language Models Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Contextualized Topic Models (CTM) are a family of neural topic models that feed BERT-style embeddings into topic modeling. The package offers two main variants: CombinedTM, which mixes contextual embeddings with a traditional bag-of-words representation to produce more coherent topics, and ZeroShotTM, which relies purely on embeddings and can handle missing vocabulary at test time or work cross-lingually when paired with multilingual sentence transformers. There’s also Kitty, a human-in-the-loop classifier for quick document labeling and filtering.

The interesting bit

The architecture is deliberately embedding-agnostic — swap in whatever new sentence transformer HuggingFace releases next week without rewriting the model. The authors also spotted a subtle preprocessing trap: BERT wants raw text, but your bag-of-words wants cleaned text, so the toolkit handles feeding different versions of the same document to each branch.

Key highlights

Published at ACL and EACL 2021; ~1,300 stars
Supports any SBERT-compatible embedding model, including multilingual ones
Zero-shot cross-lingual topic modeling via ZeroShotTM
Supervised variant (SuperCTM) and a human-in-the-loop classifier (Kitty)
Multiple Colab tutorials for different use cases
Preprocessing pipeline included to handle the raw-vs-cleaned text split

Caveats

Bag-of-words vocabulary should stay under ~2,000 terms for stable training; larger vocabs mean more parameters and worse fitting
Multilingual models used on English-only data may underperform compared to English-specific embeddings
CUDA setup is left to you — the docs just point to PyTorch’s instructions

Verdict

Worth a look if you’re still running LDA and wincing at the incoherent topic dumps. Skip it if your vocabulary is massive and unprunable, or if you need a fully unsupervised black-box solution with no preprocessing decisions to make.

Frequently asked

What is MilaNLProc/contextualized-topic-models?: A Python package that bolts contextual embeddings onto classical topic models so your topics stop being a jumble of unrelated words.
Is contextualized-topic-models open source?: Yes — MilaNLProc/contextualized-topic-models is open source, released under the MIT license.
What language is contextualized-topic-models written in?: MilaNLProc/contextualized-topic-models is primarily written in Python.
How popular is contextualized-topic-models?: MilaNLProc/contextualized-topic-models has 1.3k stars on GitHub.
Where can I find contextualized-topic-models?: MilaNLProc/contextualized-topic-models is on GitHub at https://github.com/MilaNLProc/contextualized-topic-models.