← all repositories

FMInference/H2O

A KV cache eviction policy for LLM inference that dynamically retains heavy-hitter tokens to improve throughput while maintaining accuracy.

H2O
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

H2O is a research implementation from NeurIPS 2023 that optimizes LLM inference by dynamically managing the KV cache. It identifies that a small subset of tokens contribute most to attention scores and proposes an eviction policy that balances recent tokens with these high-value heavy-hitter tokens. The approach reduces memory footprint and improves throughput significantly while maintaining model accuracy across OPT, LLaMA, and GPT-NeoX architectures.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.