jy-yuan/KIVI
A tuning-free asymmetric 2-bit quantization method for KV cache to accelerate and compress LLM inference.

KIVI is a quantization technique specifically for KV cache in large language models to reduce memory footprint and accelerate inference. It achieves compression by quantizing the KV cache to 2-bit asymmetric representation without tuning. The method is designed to work with popular LLM architectures including Llama, Mistral, and supports group-based attention for efficiency. It has been adopted into HuggingFace Transformers and integrated with flash attention for faster prefilling.