Mandarin notes on the machinery behind LLM inference
A curated Chinese-language knowledge base that dissects transformers, quantization, and inference kernels so you don't have to read the papers alone.

What it does
This repository is a structured collection of Chinese-language study notes on the mechanical guts of large language model inference. It walks through transformer architectures (LLaMA, GPT, ViT), quantization methods (SmoothQuant, AWQ), and performance optimization techniques (FlashAttention v1/v2/v3, tensor parallelism, CUDA graphs). The author also uses the repository as a landing page for a paid course on building a lightweight Triton-based inference framework, though the open-source content itself is primarily documentation and curated reading lists.
The interesting bit
Most LLM repositories ship code; this one ships reading lists and paper dissections. It treats the boring parts—Roofline models, GPU memory hierarchy, online-softmax—as first-class citizens, which is exactly where the speedups actually live. There is something almost retro about a GitHub repository that is mostly well-organized markdown homework.
Key highlights
- Extensive coverage of the
FlashAttentionevolution (v1 through v3) with dedicated paper breakdowns and a comparative summary. - Practical GPU programming tracks:
Tritonkernel development basics andCUDAarchitecture notes, including memory organization and execution models. - Framework autopsies: detailed walkthroughs of
vLLM’s inference pipeline,TGI, andLightLLM. - Quantization deep-dives:
SmoothQuantandAWQpapers with accompanying source-code analysis. - Curated external resources for
CUDA/Tritonlearning, plus the author’s blunt reviews of which textbooks are worth reading and which are outdated.
Caveats
- The content is overwhelmingly in Chinese; English-only readers need not apply.
- The README prominently advertises a paid course (¥499), and the open-source notes function more as a syllabus and bibliography than a standalone, installable framework.
Verdict Worth bookmarking if you read Chinese and are interviewing for HPC or LLM inference engineering roles; skip it if you are hunting for a pip-installable framework or English documentation.
Frequently asked
- What is harleyszhang/llm_note?
- A curated Chinese-language knowledge base that dissects transformers, quantization, and inference kernels so you don't have to read the papers alone.
- Is llm_note open source?
- Yes — harleyszhang/llm_note is an open-source project tracked on heatdrop.
- What language is llm_note written in?
- harleyszhang/llm_note is primarily written in Python.
- How popular is llm_note?
- harleyszhang/llm_note has 882 stars on GitHub.
- Where can I find llm_note?
- harleyszhang/llm_note is on GitHub at https://github.com/harleyszhang/llm_note.