Is turboquant open source?

Yes — 0xSero/turboquant is open source, released under the GPL-3.0 license.

What language is turboquant written in?

0xSero/turboquant is primarily written in Python.

How popular is turboquant?

0xSero/turboquant has 1.6k stars on GitHub.

Where can I find turboquant?

0xSero/turboquant is on GitHub at https://github.com/0xSero/turboquant.

← all repositories

0xSero/turboquant

LLM inference compression with an admirably brutal self-audit

TurboQuant implements near-lossless KV cache quantization for vLLM, then runs an adversarial audit to debunk its own paper's marketing claims.

★1.6k stars Python Inference · Serving ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does TurboQuant is a vLLM-integrated implementation of the TurboQuant paper (ICLR 2026), compressing transformer KV caches to 3-bit keys and 2-bit values via random orthogonal rotation, Lloyd-Max quantization, and QJL projection. It bit-packs the results and provides Triton kernels for decode attention, aiming to double context windows or free VRAM for concurrent requests.

The interesting bit The repo ships with an “adversarial audit” script that openly debunks its own paper’s claims—flagging the “5.1x compression” figure as misleading, noting that needle-in-haystack tests are trivial when queries equal keys, and admitting that wall-clock speedups at 30k context are within noise. That level of public self-skepticism is rare in ML infrastructure.

Key highlights

Near-lossless 3-bit key compression (cosine similarity 1.0000 measured on GPU); 2-bit values are the quality bottleneck at 0.940 cos_sim, with 4-bit values recommended for sensitive workloads.
On a pure dense transformer, claims 77% KV cache savings (4.4x compression); tested on RTX 3090 and RTX 5090 GPUs.
Modular architecture with pre-generated Lloyd-Max codebooks, flat compressed KV stores, and monkey-patched vLLM 0.18.0 integration.
35 passing tests including 9 paper theorem validations (MSE bounds, unbiasedness, distortion scaling) and 19 modular architecture tests.
Context extension measured at 2.0x on dense models and 1.45x on MoE models where only 40% of layers use full attention.

Caveats

Only compresses full-attention layers; linear-attention and Mamba states are left untouched, so MoE hybrids see just 30.9% total KV savings rather than the theoretical maximum.
The “hybrid decode” path dequantizes the entire compressed history to float32 on every step for compute, so it saves VRAM but not memory bandwidth during decoding; the fused Triton kernels exist but aren’t wired into this path yet.
Prefill still allocates through vLLM’s paged cache; TurboQuant frees the memory afterward rather than avoiding allocation entirely.

Verdict Worth a look if you’re running long-context vLLM inference on dense transformers and can tolerate a modest quality-compression tradeoff. If your model is mostly linear-attention or you need guaranteed decode speedups rather than just capacity gains, the benefits shrink accordingly.

Frequently asked

What is 0xSero/turboquant?: TurboQuant implements near-lossless KV cache quantization for vLLM, then runs an adversarial audit to debunk its own paper's marketing claims.
Is turboquant open source?: Yes — 0xSero/turboquant is open source, released under the GPL-3.0 license.
What language is turboquant written in?: 0xSero/turboquant is primarily written in Python.
How popular is turboquant?: 0xSero/turboquant has 1.6k stars on GitHub.
Where can I find turboquant?: 0xSero/turboquant is on GitHub at https://github.com/0xSero/turboquant.