A Swiss Army knife for Chinese semantic search
One Python toolkit that wraps text2vec, CLIP, BM25, Faiss, and half a dozen classic algorithms into a single pip install.

What it does
similarities is a Python toolkit that bundles semantic and literal similarity methods for text and images. It wraps modern embedding models (SentenceBERT, CLIP, Chinese-CLIP) alongside older workhorses (BM25, TF-IDF, SimHash, pHash, SIFT) and plugs them into vector indexes (Faiss, Annoy, Hnswlib) or brute-force search. The pitch: pip install, pick an algorithm, search a corpus.
The interesting bit
The project is essentially a curated integration layer for Chinese-language search. It doesn’t invent new models; it glues together text2vec, OpenAI’s CLIP, AutoFaiss, and FastAPI into something you can deploy with CLI commands (bert_embedding, clip_server, etc.). The documentation and demos are in Chinese first, English second — a rarity in this space.
Key highlights
- Text search: semantic (CoSENT/SentenceBERT) or literal (BM25, Word2Vec, Cilin, Hownet) with million-to-billion scale retrieval via Faiss
- Image search: CLIP-based text-to-image, image-to-image, plus classic perceptual hashes (pHash, dHash, etc.)
- CLI covers the full pipeline: embedding → index → search → serve (FastAPI backend, Gradio frontend)
- Pre-trained Chinese models included:
shibing624/text2vec-base-chinese, Chinese-CLIP variants - Apache 2.0 license, explicitly commercial-friendly
Caveats
- The README admits the code is “还很粗糙” (still rather rough); contributors are asked to add unit tests before PRs
- English documentation exists but is secondary; some examples and model cards may require translation
- “Billion-scale” claims are supported by Faiss integration, but actual throughput numbers aren’t benchmarked in the README
Verdict
Worth a look if you need Chinese-language semantic search without assembling a dozen repos yourself. Skip it if you want a polished, fully-documented framework or if you’re only doing English search with mature alternatives already in production.