sgl-project/mini-sglang
A compact ~5,000-line Python implementation of an LLM serving system with state-of-the-art inference optimizations.

Mini-SGLang is a reference implementation of SGLang’s LLM serving framework, providing a high-performance inference engine for large language models. It includes advanced optimizations such as Radix Cache for reusing KV cache across requests with shared prefixes, Chunked Prefill for reducing peak memory during long-context serving, Overlap Scheduling to hide CPU overhead behind GPU computation, and Tensor Parallelism for scaling across multiple GPUs. The framework integrates FlashAttention and FlashInfer kernels for maximum efficiency.