← all repositories
mayabot/mynlp

Java's answer to "just give me Chinese NLP that works"

A modular, Maven-friendly toolkit that ships perception-based segmentation, NER, pinyin, and BM25 without dragging in Python's ecosystem.

689 stars Java ML FrameworksData Tooling
mynlp
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

Mynlp is a Java-native Chinese NLP toolkit built for production use. It covers the standard bases—word segmentation, part-of-speech tagging, named entity recognition, pinyin conversion, traditional/simplified Chinese conversion, and BM25 scoring—packaged as discrete Maven modules so you pull only what you need.

The interesting bit

The resource-splitting is unusually sane. Core dictionaries and models (some 60MB+) live in separate artifacts, not bundled into the main JAR. You can opt for the “lazy” mynlp-all convenience package or cherry-pick resources à la carte—useful if you’re counting megabytes or avoiding unused model bloat in containers.

Key highlights

  • Perceptron-based segmentation and tagging (not purely dictionary-driven)
  • fastText and StarSpace integration for word/label representations
  • Custom dictionary support with correction capabilities
  • New word discovery and person-name recognition as built-in modules
  • Acknowledged lineage from HanLP and ansj_seg—borrows proven algorithms rather than reinventing them quietly

Caveats

  • Documentation and community presence (QQ group, Chinese-language docs) assume Chinese fluency; English support appears minimal
  • 690 stars suggests modest adoption outside its target ecosystem; battle-testing at scale is unclear from the README alone

Verdict

Worth a look if you’re running JVM-based services and need Chinese text processing without bridging to Python. Skip it if your pipeline is already invested in HanLP’s newer iterations or if you need extensive multilingual support.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.