flashinfer-ai/flashinfer
FlashInfer is a high-performance GPU kernel library for LLM inference with unified APIs for attention, GEMM, and MoE operations.

FlashInfer provides state-of-the-art GPU kernels optimized for LLM inference workloads including prefill and decode phases. It offers unified APIs for attention mechanisms, matrix multiplication, and mixture-of-experts operations with multiple backend implementations such as FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM. The library supports low-precision compute with FP8 and FP4 quantization and integrates with CUDAGraph and torch.compile for production deployment.