Is nlp-lang open source?

Yes — NLPchina/nlp-lang is open source, released under the Apache-2.0 license.

What language is nlp-lang written in?

NLPchina/nlp-lang is primarily written in Java.

How popular is nlp-lang?

NLPchina/nlp-lang has 1.5k stars on GitHub.

Where can I find nlp-lang?

NLPchina/nlp-lang is on GitHub at https://github.com/NLPchina/nlp-lang.

← all repositories

NLPchina/nlp-lang

A Java NLP toolbox that predates the hype cycle

Before every NLP library had a Transformer, someone had to build the boring parts—tries, Viterbi, and SimHash for Chinese text.

★1.5k stars Java Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

nlp-lang is a foundational Java utility library for Chinese NLP pipelines. It bundles trie and double-array trie structures, text segmentation, HTML stripping, Viterbi algorithm support, and a grab-bag of text-processing utilities: pinyin conversion, simplified/traditional Chinese conversion, Bloom filters, SimHash similarity, and basic word-weight statistics.

The interesting bit

The project treats NLP as infrastructure engineering, not model zookeeping. The double-array trie and SimHash implementations suggest it was built for search and deduplication at scale—problems that don’t go away just because LLMs exist.

Key highlights

Double-array trie and standard trie structures for efficient dictionary matching
Viterbi algorithm support for sequence labeling
SimHash + fingerprint deduplication for near-duplicate detection
Simplified/traditional Chinese conversion and pinyin generation
In-memory search suggestion and word co-occurrence counting
Maven artifact org.nlpcn:nlp-lang:1.7.6

Caveats

README is sparse: no usage examples, benchmarks, or API documentation beyond the feature list
Last meaningful release appears to be 1.7.6 with no changelog visible
“tire树” in the README is almost certainly a typo for “trie树” (the classic prefix tree)

Verdict

Worth a look if you’re maintaining a legacy Java search or text-processing pipeline, or need battle-tested trie implementations without pulling in a framework. Skip it if you’re building modern neural NLP and expect embeddings, tokenizers, or model serving out of the box.

Frequently asked

What is NLPchina/nlp-lang?: Before every NLP library had a Transformer, someone had to build the boring parts—tries, Viterbi, and SimHash for Chinese text.
Is nlp-lang open source?: Yes — NLPchina/nlp-lang is open source, released under the Apache-2.0 license.
What language is nlp-lang written in?: NLPchina/nlp-lang is primarily written in Java.
How popular is nlp-lang?: NLPchina/nlp-lang has 1.5k stars on GitHub.
Where can I find nlp-lang?: NLPchina/nlp-lang is on GitHub at https://github.com/NLPchina/nlp-lang.