IST-DASLab/marlin
A mixed-auto-regressive linear kernel providing FP16xINT4 quantized LLM inference with up to 4x speedups on modern GPUs.

Marlin is an extremely optimized kernel for LLM inference that performs mixed-precision matmul operations using FP16 activations and INT4 quantized weights. It achieves near-ideal 4x speedups up to batchsizes of 16-32 tokens, significantly outperforming prior quantized kernels limited to 1-2 token batches. The implementation organizes computation to fully utilize GPU resources including L2 cache, shared memory, tensor cores, and vector cores through techniques like double buffering, asynchronous memory loads, and L2 cache-aware data placement.