← all repositories
shibing624/similarities

A Swiss Army knife for Chinese semantic search

One Python toolkit that wraps text2vec, CLIP, BM25, Faiss, and half a dozen classic algorithms into a single pip install.

901 stars Python RAG · SearchData Tooling
similarities
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

similarities is a Python toolkit that bundles semantic and literal similarity methods for text and images. It wraps modern embedding models (SentenceBERT, CLIP, Chinese-CLIP) alongside older workhorses (BM25, TF-IDF, SimHash, pHash, SIFT) and plugs them into vector indexes (Faiss, Annoy, Hnswlib) or brute-force search. The pitch: pip install, pick an algorithm, search a corpus.

The interesting bit

The project is essentially a curated integration layer for Chinese-language search. It doesn’t invent new models; it glues together text2vec, OpenAI’s CLIP, AutoFaiss, and FastAPI into something you can deploy with CLI commands (bert_embedding, clip_server, etc.). The documentation and demos are in Chinese first, English second — a rarity in this space.

Key highlights

  • Text search: semantic (CoSENT/SentenceBERT) or literal (BM25, Word2Vec, Cilin, Hownet) with million-to-billion scale retrieval via Faiss
  • Image search: CLIP-based text-to-image, image-to-image, plus classic perceptual hashes (pHash, dHash, etc.)
  • CLI covers the full pipeline: embedding → index → search → serve (FastAPI backend, Gradio frontend)
  • Pre-trained Chinese models included: shibing624/text2vec-base-chinese, Chinese-CLIP variants
  • Apache 2.0 license, explicitly commercial-friendly

Caveats

  • The README admits the code is “还很粗糙” (still rather rough); contributors are asked to add unit tests before PRs
  • English documentation exists but is secondary; some examples and model cards may require translation
  • “Billion-scale” claims are supported by Faiss integration, but actual throughput numbers aren’t benchmarked in the README

Verdict

Worth a look if you need Chinese-language semantic search without assembling a dozen repos yourself. Skip it if you want a polished, fully-documented framework or if you’re only doing English search with mature alternatives already in production.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.