Google's TurboQuant, but make it searchable
A Rust vector index that squeezes 31 GB of float32 embeddings into 4 GB without a training phase, then outruns FAISS on the query.

What it does
turbovec is a local vector index with Python bindings and drop-in adapters for LangChain, LlamaIndex, Haystack, and Agno. You feed it vectors; it quantizes them to 2-bit or 4-bit on the fly, stores them compressed, and answers k-NN queries using hand-written SIMD kernels. No training step, no parameter tuning, no rebuilds as data grows.
The interesting bit
The trick is data-oblivious quantization. After a random rotation, every coordinate is expected to follow a known Beta distribution — regardless of your actual data. That predictability lets turbovec precompute optimal Lloyd-Max buckets from math alone, skipping the usual k-means codebook training entirely. A per-coordinate calibration step (TQ+) fixes drift on the first batch, then freezes.
Key highlights
- No train phase. Add vectors, they’re indexed. The index grows incrementally without retraining or rebuilds.
- Filtered search in SIMD. Pass an allowlist of ids; the kernel skips disallowed 32-vector blocks before any lookup-table work, and drops individual slots at heap-insert time. No over-fetching, no recall penalty for selective filters.
- Speed claims with nuance. On ARM (Apple M3 Max), turbovec beats FAISS IndexPQFastScan by 12–20% across all configs. On x86 (Sapphire Rapids), it wins 4-bit by 1–6%, matches 2-bit single-threaded within ~1%, and trails 2-bit multi-threaded by 2–4% — the README is upfront about where and why.
- Recall trade-offs. Beats FAISS by 0.4–3.4 points at R@1 on OpenAI d=1536/d=3072 embeddings; on low-dim GloVe d=200, it trails by 1.2 points at 2-bit R@1 (the Beta assumption is looser there) and converges by k≈16.
- Length-renormalized scoring. Stores one scalar per vector to debias the inner-product estimator at zero query-time cost. The README calls this “a one-shot price paid at ingest, not at query.”
Caveats
- The 2-bit multi-threaded x86 gap against FAISS is real and acknowledged; unrolling amortization doesn’t beat FAISS’s AVX-512 VBMI path on short accumulate loops.
- GloVe-style low-dimensional embeddings are the harder regime for this quantizer; the asymptotic Beta assumption is looser there.
Verdict
Worth a look if you’re building RAG on memory-constrained or air-gapped hardware and want a pip-installable, no-training alternative to FAISS. Less compelling if you’re already heavily invested in FAISS GPU paths or need guarantees on low-dim, 2-bit recall at small k.