← all repositories
datumbox/datumbox-framework

Java ML toolkit that predates the Python monoculture

A broad, batteries-included machine learning framework for Java developers who'd rather not bridge to Python.

1.1k stars Java ML Frameworks
datumbox-framework
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does Datumbox is a Java library that bundles classical ML and statistical methods—Naive Bayes, SVM, PCA, regression variants, clustering, ANOVA, and more—into a single Maven dependency. It also ships pre-trained models for sentiment analysis, spam detection, language detection, and several other text classification tasks via a separate “Zoo” repository.

The interesting bit The breadth is the story: this tries to be sklearn for Java circa 2013, covering everything from descriptive statistics on censored data to Dirichlet process mixture models. The author also maintains a commercial API at datumbox.com, so the framework has seen production use—though not all classes are equally battle-tested.

Key highlights

  • One dependency gets you algorithms, stats, and pre-trained NLP models
  • Pre-trained classifiers for sentiment, spam, language, gender, topic, and adult-content detection
  • Javadoc + JUnit tests serve as the primary documentation
  • Apache 2.0 licensed, actively maintained through at least 2020
  • Semantic versioning with stable releases tagged on master

Caveats

  • Explicitly alpha: public APIs may shift, and not all classes are equally tested
  • No CLI or Python bindings—contributors are explicitly asked to help with this
  • Documentation beyond Javadoc and blog posts is thin; you’ll be reading test cases

Verdict Worth a look if you’re building Java-native data pipelines and need breadth without JNI wrangling. Skip it if you’re already committed to a modern Python stack or need deep neural networks—this is classical ML, not transformers and GPUs.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.