Squeeze your KV cache until it squeaks: 3-bit keys, 2-bit values
A vLLM-integrated quantizer that trades a smidge of value precision for 2× context length on dense transformers — and openly admits where the math gets fuzzy.

What it does
TurboQuant compresses the KV cache — the memory balloon that grows with every token in LLM inference — down to 3 bits per key and 2 bits per value. It plugs into vLLM via monkey-patch, runs on consumer GPUs (tested on RTX 3090s and a 5090), and claims near-lossless key compression with a straightforward tradeoff: values get lossier.
The interesting bit
The author ships an adversarial audit of their own paper’s claims. “5.1× compression”? Misleading, they say — honest figure is ~4.6×. “Needle-in-haystack passes”? True but trivial. This kind of self-skepticism is rarer than the quantization itself. The method also uses random orthogonal rotation plus Lloyd-Max quantization on a Beta distribution, which sounds fancy but boils down to: rotate, quantize, pack bits, pray the sign bits survive.
Key highlights
- Dense transformers win big: 77% KV savings (4.4× compression) on pure full-attention models; 2× max token capacity measured on Qwen3.5-27B
- MoE models shrug: only 30.9% savings on Qwen3.5-35B-A3B because 30 of 40 layers use linear attention, which TQ can’t touch
- Value quantization is the weak link: 2-bit values hit cos_sim 0.940; 4-bit values recover to 0.997
- 35 tests pass, including 9 theorem validations and 7 core quantizer tests
- Triton kernels exist but the hybrid decode path still dequantizes everything to float32 — storage savings yes, compute savings not yet
Caveats
- Prefill still allocates a full paged cache; TQ only compresses after the fact
- The “30k context is faster” claim is within noise; wall-clock time actually slower on N=1 runs
- 200k context “works” but output quality was never checked
- PCIe-only interconnect on the 8×3090 test rig — no NVLink
Verdict
Worth a look if you’re running long-context inference on dense transformers with vLLM and VRAM is the wall. Skip it (for now) if your model mixes linear attention, or if you need guaranteed compute savings rather than just memory headroom.