← all repositories

jy-yuan/KIVI

A tuning-free asymmetric 2-bit quantization method for KV cache to accelerate and compress LLM inference.

KIVI
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

KIVI is a quantization technique specifically for KV cache in large language models to reduce memory footprint and accelerate inference. It achieves compression by quantizing the KV cache to 2-bit asymmetric representation without tuning. The method is designed to work with popular LLM architectures including Llama, Mistral, and supports group-based attention for efficiency. It has been adopted into HuggingFace Transformers and integrated with flash attention for faster prefilling.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.