NVIDIA/cutlass
CUDA C++ template library and Python DSLs for high-performance GEMM operations used as foundational building blocks in deep learning frameworks.

CUTLASS provides hierarchical decomposition and data movement abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations in CUDA. It offers reusable modular components at all scales, supporting mixed-precision computations including FP64, FP32, TF32, FP16, BF16, and narrow integer types across NVIDIA Volta through Blackwell architectures. Version 4 adds Python native interfaces for writing CUDA kernels without performance compromises, enabling faster compile times compared to pure C++ development.