← all repositories
MilaNLProc/contextualized-topic-models

BERT meets bag-of-words: topic modeling that actually reads context

A Python package that bolts contextual embeddings onto classical topic models so your topics stop being a jumble of unrelated words.

1.3k stars Python Language ModelsData Tooling
contextualized-topic-models
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

Contextualized Topic Models (CTM) are a family of neural topic models that feed BERT-style embeddings into topic modeling. The package offers two main variants: CombinedTM, which mixes contextual embeddings with a traditional bag-of-words representation to produce more coherent topics, and ZeroShotTM, which relies purely on embeddings and can handle missing vocabulary at test time or work cross-lingually when paired with multilingual sentence transformers. There’s also Kitty, a human-in-the-loop classifier for quick document labeling and filtering.

The interesting bit

The architecture is deliberately embedding-agnostic — swap in whatever new sentence transformer HuggingFace releases next week without rewriting the model. The authors also spotted a subtle preprocessing trap: BERT wants raw text, but your bag-of-words wants cleaned text, so the toolkit handles feeding different versions of the same document to each branch.

Key highlights

  • Published at ACL and EACL 2021; ~1,300 stars
  • Supports any SBERT-compatible embedding model, including multilingual ones
  • Zero-shot cross-lingual topic modeling via ZeroShotTM
  • Supervised variant (SuperCTM) and a human-in-the-loop classifier (Kitty)
  • Multiple Colab tutorials for different use cases
  • Preprocessing pipeline included to handle the raw-vs-cleaned text split

Caveats

  • Bag-of-words vocabulary should stay under ~2,000 terms for stable training; larger vocabs mean more parameters and worse fitting
  • Multilingual models used on English-only data may underperform compared to English-specific embeddings
  • CUDA setup is left to you — the docs just point to PyTorch’s instructions

Verdict

Worth a look if you’re still running LDA and wincing at the incoherent topic dumps. Skip it if your vocabulary is massive and unprunable, or if you need a fully unsupervised black-box solution with no preprocessing decisions to make.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.