Zefan-Cai/KVCache-Factory
A unified framework implementing multiple KV cache compression methods to accelerate inference for auto-regressive language models.

KVCache-Factory provides implementations of various KV cache compression techniques including PyramidKV, SnapKV, H2O, and StreamingLLM. The project targets inference optimization for large language models by reducing memory footprint and computational overhead during autoregressive generation. It supports popular models like LLaMA and Mistral with multi-GPU inference capabilities and integrates with Flash Attention v2 for performance optimization.