BERT meets bag-of-words: topic modeling that actually reads context
A Python package that bolts contextual embeddings onto classical topic models so your topics stop being a jumble of unrelated words.

What it does
Contextualized Topic Models (CTM) are a family of neural topic models that feed BERT-style embeddings into topic modeling. The package offers two main variants: CombinedTM, which mixes contextual embeddings with a traditional bag-of-words representation to produce more coherent topics, and ZeroShotTM, which relies purely on embeddings and can handle missing vocabulary at test time or work cross-lingually when paired with multilingual sentence transformers. There’s also Kitty, a human-in-the-loop classifier for quick document labeling and filtering.
The interesting bit
The architecture is deliberately embedding-agnostic — swap in whatever new sentence transformer HuggingFace releases next week without rewriting the model. The authors also spotted a subtle preprocessing trap: BERT wants raw text, but your bag-of-words wants cleaned text, so the toolkit handles feeding different versions of the same document to each branch.
Key highlights
- Published at ACL and EACL 2021; ~1,300 stars
- Supports any SBERT-compatible embedding model, including multilingual ones
- Zero-shot cross-lingual topic modeling via ZeroShotTM
- Supervised variant (SuperCTM) and a human-in-the-loop classifier (Kitty)
- Multiple Colab tutorials for different use cases
- Preprocessing pipeline included to handle the raw-vs-cleaned text split
Caveats
- Bag-of-words vocabulary should stay under ~2,000 terms for stable training; larger vocabs mean more parameters and worse fitting
- Multilingual models used on English-only data may underperform compared to English-specific embeddings
- CUDA setup is left to you — the docs just point to PyTorch’s instructions
Verdict
Worth a look if you’re still running LDA and wincing at the incoherent topic dumps. Skip it if your vocabulary is massive and unprunable, or if you need a fully unsupervised black-box solution with no preprocessing decisions to make.