tonbistudio/turboquant-pytorch
A PyTorch reimplementation of TurboQuant, Google's vector quantization algorithm for compressing LLM key-value caches.

Velocity · 7d
+14
★ / day
Trend
→steady
star history
This repository implements Google’s TurboQuant (ICLR 2026) from scratch in PyTorch. The algorithm compresses LLM key-value caches to reduce memory footprint during inference. The project includes original implementation plus an improved V3 version, with tests on NVIDIA GPUs validating compression ratios and attention fidelity at 2-5x compression rates with 3-bit and 4-bit quantization.