← all repositories

NVIDIA/cutlass

CUDA C++ template library and Python DSLs for high-performance GEMM operations used as foundational building blocks in deep learning frameworks.

cutlass
Velocity · 7d
+3.2
★ / day
Trend
steady
star history

CUTLASS provides hierarchical decomposition and data movement abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations in CUDA. It offers reusable modular components at all scales, supporting mixed-precision computations including FP64, FP32, TF32, FP16, BF16, and narrow integer types across NVIDIA Volta through Blackwell architectures. Version 4 adds Python native interfaces for writing CUDA kernels without performance compromises, enabling faster compile times compared to pure C++ development.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.