HKUDS/SepLLM
SepLLM is a sparse attention method that accelerates LLM inference by condensing information from token segments into separator tokens, reducing KV cache by over 50% with minimal performance loss.

SepLLM identifies that separator tokens (like punctuation) carry disproportionate attention weight compared to semantically meaningful tokens. The framework exploits this pattern by compressing segments between separators into single tokens, eliminating redundant computations during inference. It implements efficient training and inference kernels, supports training-free, training-from-scratch, and post-training settings, and demonstrates effectiveness across multiple benchmarks using the Llama-3-8B backbone.