← all repositories
bab2min/tomotopy

A topic-modeling library that actually uses your CPU

Python wrapper around a C++ Gibbs sampler with SIMD vectorization for when you need LDA, HDP, or fourteen other topic models without waiting for gensim to finish.

596 stars C++ Language Models
tomotopy
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

tomotopy is a Python extension of tomoto, a C++ topic-modeling library built on Collapsed Gibbs Sampling. It wraps 14+ models—LDA, Hierarchical Dirichlet Process, Correlated Topic Model, Dynamic Topic Model, and others—into a pip install-able package. The API is straightforward: add documents, call train(), inspect topics.

The interesting bit

The speedup comes from SIMD instruction sets (AVX512, AVX2, SSE2), auto-detected at import time. The README shows tomotopy running 200 iterations in less time than gensim’s 10 iterations, with comparable log-likelihood. It’s the rare Python ML library where the C++ underneath isn’t just glue—it’s doing the actual arithmetic in vectorized registers.

Key highlights

  • 14 topic models in one package, from basic LDA to Pachinko Allocation and supervised variants
  • SIMD acceleration auto-selected at runtime; tp.isa reports what your CPU supports
  • Model save/load with type safety (loading an HDP file into an LDA class raises an exception)
  • Built-in web viewer since v0.13.0 for inspecting trained models in a browser
  • Corpus utilities with transform hooks for mapping metadata between model types

Caveats

  • Requires compilation from source on non-x86 platforms or older compilers lacking C++14 support
  • The interactive viewer video in the README is hosted on a private GitHub user-images URL with an expired JWT, so it may not load for most readers
  • CGS converges more slowly than Variational Bayes in theory; the speed claim is about iteration time, not total convergence time

Verdict

Worth a look if you’re doing topic modeling at scale on x86-64 hardware and want one library that covers most major models. Skip it if you need GPU acceleration, non-x86 deployment, or variational methods specifically.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.