Is how-to-optim-algorithm-in-cuda open source?

Yes — BBuf/how-to-optim-algorithm-in-cuda is an open-source project tracked on heatdrop.

What language is how-to-optim-algorithm-in-cuda written in?

BBuf/how-to-optim-algorithm-in-cuda is primarily written in Cuda.

How popular is how-to-optim-algorithm-in-cuda?

BBuf/how-to-optim-algorithm-in-cuda has 3.1k stars on GitHub.

Where can I find how-to-optim-algorithm-in-cuda?

BBuf/how-to-optim-algorithm-in-cuda is on GitHub at https://github.com/BBuf/how-to-optim-algorithm-in-cuda.

← all repositories

BBuf/how-to-optim-algorithm-in-cuda

A public engineering notebook for GPU kernel archaeology

One developer's curated collection of handwritten CUDA kernels, CUTLASS notes, and LLM systems optimization material.

★3.1k stars Cuda Inference · Serving ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository is a living study journal for GPU systems programming. It gathers handwritten CUDA kernels (reduce, softmax, GEMV, linear attention), CUTLASS/CuTe DSL walkthroughs, Triton examples, PTX ISA notes, and PyTorch internals observations. The author also tracks LLM inference and training optimization, plus paper notes on GPU architecture and ML systems.

The interesting bit

Most CUDA tutorials stop at “hello world.” This one keeps going into WGMMA, TMA, swizzling, and instruction-level material — the kind of details you need when PyTorch’s defaults aren’t fast enough and you have to write the kernel yourself.

Key highlights

Hand-rolled kernels for common primitives (reduce, softmax, atomicAdd, upsampling) alongside trickier ones like linear attention
CUTLASS/CuTe notes covering GEMM, tensor memory acceleration, and warp-group matrix multiply-accumulate
Triton kernels with PyTorch interop examples
PTX ISA study notes for when you need to read the assembly
Active consolidation of older Chinese-language material into English entry points

Caveats

Some older notes remain in Chinese and are being gradually replaced
The deprecated/ folder exists; not everything is equally current

Verdict

Worth bookmarking if you’re implementing custom GPU kernels for LLM serving or training, or trying to understand why CUTLASS does what it does. Less useful if you just need PyTorch to run faster out of the box — this is for the “write your own kernel” crowd.

Frequently asked

What is BBuf/how-to-optim-algorithm-in-cuda?: One developer's curated collection of handwritten CUDA kernels, CUTLASS notes, and LLM systems optimization material.
Is how-to-optim-algorithm-in-cuda open source?: Yes — BBuf/how-to-optim-algorithm-in-cuda is an open-source project tracked on heatdrop.
What language is how-to-optim-algorithm-in-cuda written in?: BBuf/how-to-optim-algorithm-in-cuda is primarily written in Cuda.
How popular is how-to-optim-algorithm-in-cuda?: BBuf/how-to-optim-algorithm-in-cuda has 3.1k stars on GitHub.
Where can I find how-to-optim-algorithm-in-cuda?: BBuf/how-to-optim-algorithm-in-cuda is on GitHub at https://github.com/BBuf/how-to-optim-algorithm-in-cuda.