Zefan-Cai/R-KV
A NeurIPS 2025 paper presenting redundancy-aware KV cache compression to serve reasoning models with reduced memory while preserving full accuracy.

Velocity · 7d
+3.2
★ / day
Trend
→steady
star history
R-KV is a technique that compresses key-value cache entries by discarding repetitive tokens during LLM decoding, targeting memory reduction for reasoning models. It integrates with popular inference frameworks including vLLM, SGLang, and flash attention. The method focuses on math and reasoning benchmarks such as AIME24, supporting models like DeepSeek-R1-Distill-Llama-8B.