Is tiny-vllm open source?

Yes — jmaczan/tiny-vllm is open source, released under the Apache-2.0 license.

What language is tiny-vllm written in?

jmaczan/tiny-vllm is primarily written in C++.

How popular is tiny-vllm?

jmaczan/tiny-vllm has 824 stars on GitHub.

Where can I find tiny-vllm?

jmaczan/tiny-vllm is on GitHub at https://github.com/jmaczan/tiny-vllm.

← all repositories

jmaczan/tiny-vllm

Build a vLLM-grade inference engine from scratch in CUDA

A course and reference implementation that teaches you to build a high-performance LLM inference server in C++ and CUDA, from Safetensors to PagedAttention kernels.

★824 stars C++ Inference · Serving Language Models

View on GitHub ↗

Collecting fresh signals — velocity needs a few days of history.

collecting data…

star history

What it does tiny-vllm is a fully functional LLM inference server written in C++ and CUDA that doubles as a course. It loads a real model—Llama 3.2 1B Instruct from Safetensors—and runs the complete forward pass, including prefill and decode, using hand-rolled CUDA kernels. The author walks you through the math and engineering from tokenization up to continuous batching and PagedAttention.

The interesting bit Instead of treating inference as a black box, the project derives ideas like online softmax and PagedAttention from scratch, explaining why the boring parts—memory layout, buffer reuse, column-major transposition tricks—matter for throughput. It even runs on AMD GPUs via a thin HIP compatibility layer, reusing the CUDA sources almost verbatim.

Key highlights

Implements the full inference pipeline for Llama 3.2 1B Instruct, from Safetensors loading to token generation
Covers advanced serving mechanics: static batching, continuous batching, KV cache, and PagedAttention with a custom CUDA kernel
Derives FlashAttention-like online softmax and other kernels from first principles rather than calling a library
Ships with a course-style walkthrough of the architecture, math, and GPU memory management
Supports AMD GPUs through ROCm/HIP using a single compatibility header (src/cuda_to_hip.h) with no source forks

Verdict Grab this if you are a systems engineer or ambitious learner who wants to understand how high-throughput LLM serving actually works under the hood. Skip it if you just need a production inference endpoint and have no patience for manual CUDA kernel tuning.

Frequently asked

What is jmaczan/tiny-vllm?: A course and reference implementation that teaches you to build a high-performance LLM inference server in C++ and CUDA, from Safetensors to PagedAttention kernels.
Is tiny-vllm open source?: Yes — jmaczan/tiny-vllm is open source, released under the Apache-2.0 license.
What language is tiny-vllm written in?: jmaczan/tiny-vllm is primarily written in C++.
How popular is tiny-vllm?: jmaczan/tiny-vllm has 824 stars on GitHub.
Where can I find tiny-vllm?: jmaczan/tiny-vllm is on GitHub at https://github.com/jmaczan/tiny-vllm.