← all repositories

Zefan-Cai/R-KV

A NeurIPS 2025 paper presenting redundancy-aware KV cache compression to serve reasoning models with reduced memory while preserving full accuracy.

R-KV
Velocity · 7d
+3.2
★ / day
Trend
steady
star history

R-KV is a technique that compresses key-value cache entries by discarding repetitive tokens during LLM decoding, targeting memory reduction for reasoning models. It integrates with popular inference frameworks including vLLM, SGLang, and flash attention. The method focuses on math and reasoning benchmarks such as AIME24, supporting models like DeepSeek-R1-Distill-Llama-8B.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.