A Java NLP toolbox that predates the hype cycle
Before every NLP library had a Transformer, someone had to build the boring parts—tries, Viterbi, and SimHash for Chinese text.

What it does
nlp-lang is a foundational Java utility library for Chinese NLP pipelines. It bundles trie and double-array trie structures, text segmentation, HTML stripping, Viterbi algorithm support, and a grab-bag of text-processing utilities: pinyin conversion, simplified/traditional Chinese conversion, Bloom filters, SimHash similarity, and basic word-weight statistics.
The interesting bit
The project treats NLP as infrastructure engineering, not model zookeeping. The double-array trie and SimHash implementations suggest it was built for search and deduplication at scale—problems that don’t go away just because LLMs exist.
Key highlights
- Double-array trie and standard trie structures for efficient dictionary matching
- Viterbi algorithm support for sequence labeling
- SimHash + fingerprint deduplication for near-duplicate detection
- Simplified/traditional Chinese conversion and pinyin generation
- In-memory search suggestion and word co-occurrence counting
- Maven artifact
org.nlpcn:nlp-lang:1.7.6
Caveats
- README is sparse: no usage examples, benchmarks, or API documentation beyond the feature list
- Last meaningful release appears to be 1.7.6 with no changelog visible
- “tire树” in the README is almost certainly a typo for “trie树” (the classic prefix tree)
Verdict
Worth a look if you’re maintaining a legacy Java search or text-processing pipeline, or need battle-tested trie implementations without pulling in a framework. Skip it if you’re building modern neural NLP and expect embeddings, tokenizers, or model serving out of the box.