microsoft/MInference
MInference is a sparse attention kernel that accelerates long-context LLM inference by up to 10x on A100 GPUs.

Velocity · 7d
+1.6
★ / day
Trend
→steady
star history
MInference implements dynamic sparse attention to speed up the pre-fill phase of long-context language model inference. It uses approximate calculation techniques to reduce memory bandwidth bottlenecks while maintaining accuracy. The project provides optimized kernels compatible with major inference frameworks like SGLang and vLLM, achieving significant speedups especially for million-token context lengths.