cfregly/ai-performance-engineering
Code and resources for an O'Reilly book covering GPU optimization, distributed training, and inference scaling for AI systems.

This repository contains code, tooling, and resources accompanying an O’Reilly book on AI systems performance engineering. The material covers GPU optimization, distributed training pipelines, and inference scaling techniques. It provides hands-on guidance for profiling AI workloads with PyTorch profilers and Nsight, and demonstrates high-throughput inference patterns using vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo, including paged KV cache and disaggregated prefill/decode architectures.