← all repositories

NVIDIA/kvpress

A Python library implementing multiple KV cache compression methods to reduce LLM inference memory usage.

kvpress
Velocity · 7d
+1.9
★ / day
Trend
steady
star history

kvpress provides “presses” that compress the key-value cache during the prefilling phase of LLM inference. It wraps Hugging Face Transformers with custom compression strategies, allowing deployment of long-context models like Llama 3.1-70B with significantly reduced memory requirements. The project includes benchmarking tools and integrates with existing transformer pipelines for easy adoption.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.