← all repositories
jmaczan/tiny-vllm

Build a vLLM-grade inference engine from scratch in CUDA

A course and reference implementation that teaches you to build a high-performance LLM inference server in C++ and CUDA, from Safetensors to PagedAttention kernels.

tiny-vllm
Collecting fresh signals — velocity needs a few days of history.
collecting data…
star history

What it does tiny-vllm is a fully functional LLM inference server written in C++ and CUDA that doubles as a course. It loads a real model—Llama 3.2 1B Instruct from Safetensors—and runs the complete forward pass, including prefill and decode, using hand-rolled CUDA kernels. The author walks you through the math and engineering from tokenization up to continuous batching and PagedAttention.

The interesting bit Instead of treating inference as a black box, the project derives ideas like online softmax and PagedAttention from scratch, explaining why the boring parts—memory layout, buffer reuse, column-major transposition tricks—matter for throughput. It even runs on AMD GPUs via a thin HIP compatibility layer, reusing the CUDA sources almost verbatim.

Key highlights

  • Implements the full inference pipeline for Llama 3.2 1B Instruct, from Safetensors loading to token generation
  • Covers advanced serving mechanics: static batching, continuous batching, KV cache, and PagedAttention with a custom CUDA kernel
  • Derives FlashAttention-like online softmax and other kernels from first principles rather than calling a library
  • Ships with a course-style walkthrough of the architecture, math, and GPU memory management
  • Supports AMD GPUs through ROCm/HIP using a single compatibility header (src/cuda_to_hip.h) with no source forks

Verdict Grab this if you are a systems engineer or ambitious learner who wants to understand how high-throughput LLM serving actually works under the hood. Skip it if you just need a production inference endpoint and have no patience for manual CUDA kernel tuning.

Frequently asked

What is jmaczan/tiny-vllm?
A course and reference implementation that teaches you to build a high-performance LLM inference server in C++ and CUDA, from Safetensors to PagedAttention kernels.
Is tiny-vllm open source?
Yes — jmaczan/tiny-vllm is open source, released under the Apache-2.0 license.
What language is tiny-vllm written in?
jmaczan/tiny-vllm is primarily written in C++.
How popular is tiny-vllm?
jmaczan/tiny-vllm has 824 stars on GitHub.
Where can I find tiny-vllm?
jmaczan/tiny-vllm is on GitHub at https://github.com/jmaczan/tiny-vllm.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.