← all repositories
shibing624/similarity

A Java NLP swiss-army knife for Chinese text similarity

Java shops doing Chinese NLP finally get a batteries-included similarity toolkit that doesn't force you into Python.

1.6k stars Java RAG · SearchML Frameworks
similarity
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

similarity is a Java library that scores how alike Chinese texts are—words, phrases, sentences, or paragraphs—plus throws in sentiment analysis and word2vec-powered near-synonym lookup. It packages a dozen-odd algorithms behind a single static Similarity class so you can call cilinSimilarity("教师", "教授") or morphoSimilarity(sentence1, sentence2) without wiring up your own pipeline.

The interesting bit

The author explicitly treats this as a teaching vehicle: the goal is “spreading NLP similarity methods,” not hiding them behind a black box. Dictionaries ship as plain text, models lazy-load, and the code is deliberately low-coupling—unusual for a one-stop Java NLP library, where opaque jars are the norm.

Key highlights

  • Granular matching: word-level (Cilin thesaurus, Hownet semantics, pinyin, edit distance), phrase, sentence (morpho + four edit-distance variants), and paragraph (cosine, SimHash, Jaccard, Jaro–Winkler, etc.)
  • Sentiment scoring via Hownet sememe trees at the word level
  • Word2vec synonym expansion with a bundled trainer; demo model trained on Demi-Gods and Semi-Devils (wuxia corpus included, apparently)
  • Distributed via JitPack, Apache 2.0 licensed

Caveats

  • The author notes the “code is still rough” and asks for PRs with unit tests
  • Deep semantic matching (BERT, DSSM, etc.) is listed in the Todo but crossed out with a pointer to the author’s separate text2vec Python project—so this library stays classical/shallow
  • Sentiment analysis is word-granularity only; for document-level the author again points to a Python sibling

Verdict

Worth a look if you’re stuck in a Java codebase and need explainable, classical similarity metrics for Chinese text without shipping a Python service. Skip it if you need modern transformer embeddings or polished production polish; the author basically tells you to go Python for that.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.