SqueezeAILab/KVQuant
A quantization methodology for KV cache compression enabling 10M context length LLM inference on a single A100 GPU.

KVQuant addresses the memory bottleneck in long-context LLM inference by quantizing the KV cache to low precision. It achieves high accuracy by exploiting consistent patterns in cached KV values, including per-channel pre-RoPE key quantization to handle outlier channels, non-uniform quantization for asymmetric activations, and dense-and-sparse quantization to mitigate numerical outliers. This enables serving LLaMA-7B with 1M context on a single A100-80GB or 10M context on an 8-GPU system.