← all repositories
sgl-project/sglang

The inference engine running on 400,000 GPUs wants your attention

SGLang is an open-source serving framework that went from research project to production backbone for LLMs, diffusion models, and reinforcement learning.

sglang
Velocity · 7d
+33
★ / day
Trend
steady
star history

What it does SGLang serves large language and multimodal models with an eye on throughput and latency. It handles the usual suspects—continuous batching, paged attention, speculative decoding, quantization, tensor/pipeline/expert parallelism—and adds a frontend API for structured generation. It also runs diffusion models and doubles as a rollout backend for RL training.

The interesting bit The project started at LMSYS and now claims deployment on over 400,000 GPUs worldwide, including at xAI, LinkedIn, and Cursor. Its “RadixAttention” prefix caching and zero-overhead CPU scheduler are pitched as differentiators, but the real signal is the hardware sprawl: NVIDIA GB300 down to consumer 5090s, AMD MI300, Intel Xeon, Google TPU, even Ascend NPUs. That breadth is unusual for a single inference stack.

Key highlights

  • Day-0 support for recent heavy hitters: DeepSeek V3/R1, DeepSeek-V3.2 with sparse attention, Qwen, Llama, GPT-oss, MiniMax M2, MiMo-V2-Flash
  • Native TPU backend via SGLang-Jax; AMD and Intel optimizations are first-class, not afterthoughts
  • Prefill-decode disaggregation and large-scale expert parallelism benchmarked on 96 H100s and GB200 NVL72 racks
  • Structured outputs via compressed finite state machines (claimed 3× faster JSON decoding in early 2024)
  • Integrated with post-training frameworks: verl, AReaL, Miles, slime, Tunix

Caveats

  • The README is heavy on namedrop adoption and light on reproducible benchmarks you can run yourself; most performance claims link to LMSYS blog posts
  • Acknowledges borrowing design and code from vLLM, FlashInfer, Outlines, and others—worth checking if you’re already invested in those stacks

Verdict If you’re operating inference at scale or need a single backend that spans GPUs, TPUs, and RL rollouts, SGLang is now hard to ignore. For a single-GPU hobbyist, the complexity may outweigh the gains unless you need specific model or hardware support that vLLM or TGI lacks.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.