← all repositories

tonbistudio/turboquant-pytorch

A PyTorch reimplementation of TurboQuant, Google's vector quantization algorithm for compressing LLM key-value caches.

turboquant-pytorch
Velocity · 7d
+14
★ / day
Trend
steady
star history

This repository implements Google’s TurboQuant (ICLR 2026) from scratch in PyTorch. The algorithm compresses LLM key-value caches to reduce memory footprint during inference. The project includes original implementation plus an improved V3 version, with tests on NVIDIA GPUs validating compression ratios and attention fidelity at 2-5x compression rates with 3-bit and 4-bit quantization.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.