← all repositories

SqueezeAILab/KVQuant

A quantization methodology for KV cache compression enabling 10M context length LLM inference on a single A100 GPU.

KVQuant
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

KVQuant addresses the memory bottleneck in long-context LLM inference by quantizing the KV cache to low precision. It achieves high accuracy by exploiting consistent patterns in cached KV values, including per-channel pre-RoPE key quantization to handle outlier channels, non-uniform quantization for asymmetric activations, and dense-and-sparse quantization to mitigate numerical outliers. This enables serving LLaMA-7B with 1M context on a single A100-80GB or 10M context on an 8-GPU system.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.