A Java NLP swiss-army knife for Chinese text similarity
Java shops doing Chinese NLP finally get a batteries-included similarity toolkit that doesn't force you into Python.

What it does
similarity is a Java library that scores how alike Chinese texts are—words, phrases, sentences, or paragraphs—plus throws in sentiment analysis and word2vec-powered near-synonym lookup. It packages a dozen-odd algorithms behind a single static Similarity class so you can call cilinSimilarity("教师", "教授") or morphoSimilarity(sentence1, sentence2) without wiring up your own pipeline.
The interesting bit
The author explicitly treats this as a teaching vehicle: the goal is “spreading NLP similarity methods,” not hiding them behind a black box. Dictionaries ship as plain text, models lazy-load, and the code is deliberately low-coupling—unusual for a one-stop Java NLP library, where opaque jars are the norm.
Key highlights
- Granular matching: word-level (Cilin thesaurus, Hownet semantics, pinyin, edit distance), phrase, sentence (morpho + four edit-distance variants), and paragraph (cosine, SimHash, Jaccard, Jaro–Winkler, etc.)
- Sentiment scoring via Hownet sememe trees at the word level
- Word2vec synonym expansion with a bundled trainer; demo model trained on Demi-Gods and Semi-Devils (wuxia corpus included, apparently)
- Distributed via JitPack, Apache 2.0 licensed
Caveats
- The author notes the “code is still rough” and asks for PRs with unit tests
- Deep semantic matching (BERT, DSSM, etc.) is listed in the Todo but crossed out with a pointer to the author’s separate
text2vecPython project—so this library stays classical/shallow - Sentiment analysis is word-granularity only; for document-level the author again points to a Python sibling
Verdict
Worth a look if you’re stuck in a Java codebase and need explainable, classical similarity metrics for Chinese text without shipping a Python service. Skip it if you need modern transformer embeddings or polished production polish; the author basically tells you to go Python for that.