FMInference/H2O
A KV cache eviction policy for LLM inference that dynamically retains heavy-hitter tokens to improve throughput while maintaining accuracy.

H2O is a research implementation from NeurIPS 2023 that optimizes LLM inference by dynamically managing the KV cache. It identifies that a small subset of tokens contribute most to attention scores and proposes an eviction policy that balances recent tokens with these high-value heavy-hitter tokens. The approach reduces memory footprint and improves throughput significantly while maintaining model accuracy across OPT, LLaMA, and GPT-NeoX architectures.