BBuf/how-to-optim-algorithm-in-cuda
A study notebook containing CUDA kernels, Triton code, CUTLASS notes, and LLM inference/training optimization material.

This repository serves as a public engineering notebook for GPU systems work, focusing on optimizing AI/ML algorithms in CUDA. It includes handwritten CUDA kernels for common operations (reduce, softmax, GEMV, linear attention), CUTLASS and CuTe DSL notes covering GEMM, TMA, and WGMMA, Triton kernels with PyTorch interop examples, and extensive LLM serving and training optimization notes. The material is organized into directories covering cuda-kernels, cuda-mode lectures, cutlass, triton, large-language-model systems, and PyTorch internals.