← all repositories

sgl-project/mini-sglang

A compact ~5,000-line Python implementation of an LLM serving system with state-of-the-art inference optimizations.

4.4k stars Python Inference · Serving
mini-sglang
Velocity · 7d
+16
★ / day
Trend
steady
star history

Mini-SGLang is a reference implementation of SGLang’s LLM serving framework, providing a high-performance inference engine for large language models. It includes advanced optimizations such as Radix Cache for reusing KV cache across requests with shared prefixes, Chunked Prefill for reducing peak memory during long-context serving, Overlap Scheduling to hide CPU overhead behind GPU computation, and Tensor Parallelism for scaling across multiple GPUs. The framework integrates FlashAttention and FlashInfer kernels for maximum efficiency.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.