← all repositories
CalvinXKY/InfraTech

A Chinese notebook dojo for the messy guts of LLM inference

Jupyter notebooks that walk through vLLM scheduling, SGLang's RadixAttention, and why prefix cache is "zero overhead"—not a framework, more like a disassembly manual.

2.5k stars Jupyter Notebook Inference · ServingML FrameworksLearning
InfraTech
Velocity · 7d
+12
★ / day
Trend
steady
star history

What it does InfraTech is a curated collection of Jupyter notebooks and Zhihu articles covering the internals of LLM training and inference infrastructure. Topics span PyTorch, vLLM, SGLang, distributed strategies (DP/TP/PP/SP/EP), quantization, speculative decoding, and even CUDA graph quirks. Each notebook pairs with a Chinese-language explainer article. Think of it as a self-study syllabus for the stack between “I know transformers” and “I can debug a production serving framework.”

The interesting bit The author doesn’t just explain—he “hand-crafts” (手搓) minimal implementations, like a basic vLLM scheduler or a from-scratch SGLang profiler, then layers in real-framework analysis. The “Nano-vLLM” notebook in particular builds a toy inference engine to map concepts before you drown in the actual vLLM codebase.

Key highlights

  • ~25 notebooks with difficulty ratings (⚡️ to ⚡️⚡️⚡️), so you can gauge depth before diving in
  • Heavy focus on inference optimization: chunked prefill, flash decoding, KV cache management, prefix caching
  • Framework-specific deep cuts: vLLM memory snapshots, SGLang RadixAttention, RL training-inference colocation with Megatron
  • Distributed systems basics: collective operations, tensor parallelism, Ulysses attention
  • Linked Zhihu articles provide narrative context; notebooks provide runnable code

Caveats

  • Entirely Chinese-language; no English translations visible in the README
  • Some notebooks live in other repos (e.g., BasicCUDA for nondeterministic reduction) without clear cross-repo navigation
  • No automated tests or CI; it’s a knowledge repo, not a maintained library

Verdict Worth bookmarking if you’re an engineer in the AI infra space who reads Chinese and learns by disassembling. Skip it if you need a drop-in tool or English-only documentation.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.