← all repositories
NLPchina/nlp-lang

A Java NLP toolbox that predates the hype cycle

Before every NLP library had a Transformer, someone had to build the boring parts—tries, Viterbi, and SimHash for Chinese text.

1.5k stars Java Data Tooling
nlp-lang
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

nlp-lang is a foundational Java utility library for Chinese NLP pipelines. It bundles trie and double-array trie structures, text segmentation, HTML stripping, Viterbi algorithm support, and a grab-bag of text-processing utilities: pinyin conversion, simplified/traditional Chinese conversion, Bloom filters, SimHash similarity, and basic word-weight statistics.

The interesting bit

The project treats NLP as infrastructure engineering, not model zookeeping. The double-array trie and SimHash implementations suggest it was built for search and deduplication at scale—problems that don’t go away just because LLMs exist.

Key highlights

  • Double-array trie and standard trie structures for efficient dictionary matching
  • Viterbi algorithm support for sequence labeling
  • SimHash + fingerprint deduplication for near-duplicate detection
  • Simplified/traditional Chinese conversion and pinyin generation
  • In-memory search suggestion and word co-occurrence counting
  • Maven artifact org.nlpcn:nlp-lang:1.7.6

Caveats

  • README is sparse: no usage examples, benchmarks, or API documentation beyond the feature list
  • Last meaningful release appears to be 1.7.6 with no changelog visible
  • “tire树” in the README is almost certainly a typo for “trie树” (the classic prefix tree)

Verdict

Worth a look if you’re maintaining a legacy Java search or text-processing pipeline, or need battle-tested trie implementations without pulling in a framework. Skip it if you’re building modern neural NLP and expect embeddings, tokenizers, or model serving out of the box.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.