← all repositories

microsoft/MInference

MInference is a sparse attention kernel that accelerates long-context LLM inference by up to 10x on A100 GPUs.

MInference
Velocity · 7d
+1.6
★ / day
Trend
steady
star history

MInference implements dynamic sparse attention to speed up the pre-fill phase of long-context language model inference. It uses approximate calculation techniques to reduce memory bandwidth bottlenecks while maintaining accuracy. The project provides optimized kernels compatible with major inference frameworks like SGLang and vLLM, achieving significant speedups especially for million-token context lengths.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.