Build a vLLM-grade inference engine from scratch in CUDA
A course and reference implementation that teaches you to build a high-performance LLM inference server in C++ and CUDA, from Safetensors to PagedAttention kernels.

What it does tiny-vllm is a fully functional LLM inference server written in C++ and CUDA that doubles as a course. It loads a real model—Llama 3.2 1B Instruct from Safetensors—and runs the complete forward pass, including prefill and decode, using hand-rolled CUDA kernels. The author walks you through the math and engineering from tokenization up to continuous batching and PagedAttention.
The interesting bit Instead of treating inference as a black box, the project derives ideas like online softmax and PagedAttention from scratch, explaining why the boring parts—memory layout, buffer reuse, column-major transposition tricks—matter for throughput. It even runs on AMD GPUs via a thin HIP compatibility layer, reusing the CUDA sources almost verbatim.
Key highlights
- Implements the full inference pipeline for Llama 3.2 1B Instruct, from Safetensors loading to token generation
- Covers advanced serving mechanics: static batching, continuous batching, KV cache, and PagedAttention with a custom CUDA kernel
- Derives FlashAttention-like online softmax and other kernels from first principles rather than calling a library
- Ships with a course-style walkthrough of the architecture, math, and GPU memory management
- Supports AMD GPUs through ROCm/HIP using a single compatibility header (
src/cuda_to_hip.h) with no source forks
Verdict Grab this if you are a systems engineer or ambitious learner who wants to understand how high-throughput LLM serving actually works under the hood. Skip it if you just need a production inference endpoint and have no patience for manual CUDA kernel tuning.
Frequently asked
- What is jmaczan/tiny-vllm?
- A course and reference implementation that teaches you to build a high-performance LLM inference server in C++ and CUDA, from Safetensors to PagedAttention kernels.
- Is tiny-vllm open source?
- Yes — jmaczan/tiny-vllm is open source, released under the Apache-2.0 license.
- What language is tiny-vllm written in?
- jmaczan/tiny-vllm is primarily written in C++.
- How popular is tiny-vllm?
- jmaczan/tiny-vllm has 824 stars on GitHub.
- Where can I find tiny-vllm?
- jmaczan/tiny-vllm is on GitHub at https://github.com/jmaczan/tiny-vllm.