thu-ml/SageAttention
A CUDA-based quantized attention library that accelerates transformer inference 2-5x via INT4/INT8/FP4/FP8 quantization without accuracy loss.

SageAttention provides optimized attention kernels for Ampere, Ada, and Hopper GPUs that leverage quantization to speed up inference across language, vision, and video models. It includes INT8 quantization for QK^top with smoothing, FP8 quantization for PV computation, and a two-level accumulation strategy to maintain accuracy at FP8 precision. The library supports plug-and-play deployment across diverse transformer architectures.