skyzh/tiny-llm
A hands-on course teaching systems engineers to build a minimal vLLM-like LLM inference engine from scratch using Python and MLX.

Velocity · 7d
+10
★ / day
Trend
→steady
star history
This repository provides a week-long course on LLM serving infrastructure for systems engineers. It covers implementing core LLM components (attention, RoPE, QK norm) in pure Python without high-level neural network APIs, then builds a simplified vLLM-style inference system with KV caching, continuous batching, and flash attention optimizations. The course uses Qwen3 models running on Apple Silicon via MLX framework.