NVIDIA/kvpress
A Python library implementing multiple KV cache compression methods to reduce LLM inference memory usage.

Velocity · 7d
+1.9
★ / day
Trend
→steady
star history
kvpress provides “presses” that compress the key-value cache during the prefilling phase of LLM inference. It wraps Hugging Face Transformers with custom compression strategies, allowing deployment of long-context models like Llama 3.1-70B with significantly reduced memory requirements. The project includes benchmarking tools and integrates with existing transformer pipelines for easy adoption.