← all repositories
ChunelFeng/caiss

ANN search with training wheels: auto-tuned params and SQL syntax

A C++ similarity-search engine that tries to spare you from tuning HNSW/Faiss knobs by hand, wrapped in multi-language SDKs and a SQL layer.

caiss
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

Caiss is a cross-platform, multi-language approximate-nearest-neighbor (ANN) retrieval engine for vectors, words, and sentences. It exposes a C-style core API with bindings for Python, Java, and C#, plus a RESTful interface, Docker images, and a web demo. The README emphasizes “CRUD” operations—train, search, insert, ignore, save—rather than read-only search, and adds a SQL-like syntax for basic data manipulation.

The interesting bit

The project pitches itself as a response to “won’t tune parameters” and “distance functions are hard.” Its CAISS_Train API accepts a target precision, fastRank, and realRank; the engine iterates up to maxEpoch and auto-adjusts until the fast approximate results land within the real top-k at the requested accuracy. Whether this auto-tuning is a thin wrapper or genuinely novel is unclear from the README—the internals aren’t shown—but the intent is to lower the barrier for developers who know they need ANN but don’t want to become HNSW parameter experts.

Key highlights

  • Auto-training loop with precision target and epoch limits (no manual HNSW efConstruction tweaking required, in theory)
  • Multiple distance metrics plus support for custom distance functions via callback
  • Multi-threading and caching support; batch query capable
  • SDKs in C++, Python, Java, C#; RESTful and SQL interfaces; Docker image available
  • Web demo at chunel.cn:3001 for English word similarity (pipe-separated multi-word queries)

Caveats

  • The README is heavy on feature lists and light on benchmarks, algorithmic details, or comparisons against raw Faiss/HNSW/MRPT performance
  • Training pipeline currently depends on TensorFlow/Keras/BERT for text embedding; you bring your own model and format the JSON vector files
  • “Based on Google, Facebook, Alibaba existing results” is stated but not substantiated with citations or code provenance

Verdict

Worth a look if you need ANN search in a polyglot stack and would rather trade some transparency for convenience. Hardcore vector-database users will still want to see benchmarks and architecture docs before betting production workloads on it.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.