← all repositories
harleyszhang/llm_note

Mandarin notes on the machinery behind LLM inference

A curated Chinese-language knowledge base that dissects transformers, quantization, and inference kernels so you don't have to read the papers alone.

llm_note
Collecting fresh signals — velocity needs a few days of history.
collecting data…
star history

What it does This repository is a structured collection of Chinese-language study notes on the mechanical guts of large language model inference. It walks through transformer architectures (LLaMA, GPT, ViT), quantization methods (SmoothQuant, AWQ), and performance optimization techniques (FlashAttention v1/v2/v3, tensor parallelism, CUDA graphs). The author also uses the repository as a landing page for a paid course on building a lightweight Triton-based inference framework, though the open-source content itself is primarily documentation and curated reading lists.

The interesting bit Most LLM repositories ship code; this one ships reading lists and paper dissections. It treats the boring parts—Roofline models, GPU memory hierarchy, online-softmax—as first-class citizens, which is exactly where the speedups actually live. There is something almost retro about a GitHub repository that is mostly well-organized markdown homework.

Key highlights

  • Extensive coverage of the FlashAttention evolution (v1 through v3) with dedicated paper breakdowns and a comparative summary.
  • Practical GPU programming tracks: Triton kernel development basics and CUDA architecture notes, including memory organization and execution models.
  • Framework autopsies: detailed walkthroughs of vLLM’s inference pipeline, TGI, and LightLLM.
  • Quantization deep-dives: SmoothQuant and AWQ papers with accompanying source-code analysis.
  • Curated external resources for CUDA/Triton learning, plus the author’s blunt reviews of which textbooks are worth reading and which are outdated.

Caveats

  • The content is overwhelmingly in Chinese; English-only readers need not apply.
  • The README prominently advertises a paid course (¥499), and the open-source notes function more as a syllabus and bibliography than a standalone, installable framework.

Verdict Worth bookmarking if you read Chinese and are interviewing for HPC or LLM inference engineering roles; skip it if you are hunting for a pip-installable framework or English documentation.

Frequently asked

What is harleyszhang/llm_note?
A curated Chinese-language knowledge base that dissects transformers, quantization, and inference kernels so you don't have to read the papers alone.
Is llm_note open source?
Yes — harleyszhang/llm_note is an open-source project tracked on heatdrop.
What language is llm_note written in?
harleyszhang/llm_note is primarily written in Python.
How popular is llm_note?
harleyszhang/llm_note has 882 stars on GitHub.
Where can I find llm_note?
harleyszhang/llm_note is on GitHub at https://github.com/harleyszhang/llm_note.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.