← all repositories

flashinfer-ai/flashinfer

FlashInfer is a high-performance GPU kernel library for LLM inference with unified APIs for attention, GEMM, and MoE operations.

5.8k stars Python Inference · Serving
flashinfer
Velocity · 7d
+5.5
★ / day
Trend
steady
star history

FlashInfer provides state-of-the-art GPU kernels optimized for LLM inference workloads including prefill and decode phases. It offers unified APIs for attention mechanisms, matrix multiplication, and mixture-of-experts operations with multiple backend implementations such as FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM. The library supports low-precision compute with FP8 and FP4 quantization and integrates with CUDAGraph and torch.compile for production deployment.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.