← all repositories
mimno/Mallet

Java NLP toolkit that predates the deep-learning hype cycle

Before transformers took over, MALLET was the workhorse for topic modeling, classification, and sequence tagging in Java.

1k stars Java ML FrameworksData Tooling
Mallet
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

MALLET is a Java toolkit for classical statistical NLP: document classification, topic modeling, sequence tagging, and clustering. It provides command-line tools and a Maven-packaged library for turning raw text into feature vectors, training models, and evaluating results.

The interesting bit

The “pipes” system is the quiet workhorse. Each pipe handles one preprocessing step—tokenization, stopword removal, vectorization—and they chain together like a Unix pipeline. This was modular design before it was trendy, and it still matters if you need reproducible text-to-numbers workflows.

Key highlights

  • Topic modeling is the headline: LDA (including parallel, hierarchical, and labeled variants), Pachinko Allocation, plus word2vec-style skip-gram embeddings
  • Classification spans Naïve Bayes, MaxEnt/Logistic Regression, Decision Trees, AdaBoost, Bagging, and Winnow
  • Sequence tagging via CRFs, MEMMs, and HMMs, built on an extensible finite-state transducer framework
  • Includes L-BFGS and other numerical optimizers, since many algorithms need them
  • GRMM add-on supports inference in general graphical models and CRFs with arbitrary structure

Caveats

  • Requires Java 17+ and Maven; macOS users may need to manually fix their OpenJDK path
  • The README lists algorithms but doesn’t clarify which are actively maintained versus legacy
  • No mention of GPU acceleration or neural architectures—this is intentionally old-school

Verdict

Useful if you’re maintaining legacy Java NLP pipelines, teaching classical methods, or need topic models without dragging in Python’s entire ML stack. Skip it if you want BERT, LLMs, or anything that requires a CUDA driver.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.