Is similarity open source?

Yes — shibing624/similarity is open source, released under the Apache-2.0 license.

What language is similarity written in?

shibing624/similarity is primarily written in Java.

How popular is similarity?

shibing624/similarity has 1.6k stars on GitHub.

Where can I find similarity?

shibing624/similarity is on GitHub at https://github.com/shibing624/similarity.

← all repositories

shibing624/similarity

A Java NLP swiss-army knife for Chinese text similarity

Java shops doing Chinese NLP finally get a batteries-included similarity toolkit that doesn't force you into Python.

★1.6k stars Java RAG · Search ML Frameworks

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

similarity is a Java library that scores how alike Chinese texts are—words, phrases, sentences, or paragraphs—plus throws in sentiment analysis and word2vec-powered near-synonym lookup. It packages a dozen-odd algorithms behind a single static Similarity class so you can call cilinSimilarity("教师", "教授") or morphoSimilarity(sentence1, sentence2) without wiring up your own pipeline.

The interesting bit

The author explicitly treats this as a teaching vehicle: the goal is “spreading NLP similarity methods,” not hiding them behind a black box. Dictionaries ship as plain text, models lazy-load, and the code is deliberately low-coupling—unusual for a one-stop Java NLP library, where opaque jars are the norm.

Key highlights

Granular matching: word-level (Cilin thesaurus, Hownet semantics, pinyin, edit distance), phrase, sentence (morpho + four edit-distance variants), and paragraph (cosine, SimHash, Jaccard, Jaro–Winkler, etc.)
Sentiment scoring via Hownet sememe trees at the word level
Word2vec synonym expansion with a bundled trainer; demo model trained on Demi-Gods and Semi-Devils (wuxia corpus included, apparently)
Distributed via JitPack, Apache 2.0 licensed

Caveats

The author notes the “code is still rough” and asks for PRs with unit tests
Deep semantic matching (BERT, DSSM, etc.) is listed in the Todo but crossed out with a pointer to the author’s separate text2vec Python project—so this library stays classical/shallow
Sentiment analysis is word-granularity only; for document-level the author again points to a Python sibling

Verdict

Worth a look if you’re stuck in a Java codebase and need explainable, classical similarity metrics for Chinese text without shipping a Python service. Skip it if you need modern transformer embeddings or polished production polish; the author basically tells you to go Python for that.

Frequently asked

What is shibing624/similarity?: Java shops doing Chinese NLP finally get a batteries-included similarity toolkit that doesn't force you into Python.
Is similarity open source?: Yes — shibing624/similarity is open source, released under the Apache-2.0 license.
What language is similarity written in?: shibing624/similarity is primarily written in Java.
How popular is similarity?: shibing624/similarity has 1.6k stars on GitHub.
Where can I find similarity?: shibing624/similarity is on GitHub at https://github.com/shibing624/similarity.