xlite-dev/LeetCUDA
Educational repository teaching modern CUDA programming with PyTorch, featuring 200+ kernels and implementation examples of flash attention and HGEMM.

Velocity · 7d
+8.8
★ / day
Trend
→steady
star history
LeetCUDA is a learning-focused CUDA tutorial aimed at beginners, providing annotated implementations of GPU kernels including half-precision matrix multiplication (HGEMM), flash attention using tensor cores with pure MMA PTX, and various CUDA programming patterns. The content is structured around PyTorch integration and covers topics like TF32/F16/BF16/F8 precision formats used in deep learning workloads.