vllm-project/production-stack
A Kubernetes-native reference stack for deploying and scaling vLLM LLM inference across distributed clusters.

Velocity · 7d
+4.7
★ / day
Trend
→steady
star history
The vLLM Production Stack provides a reference implementation for deploying vLLM inference engines at scale in production. It enables scaling from single instances to distributed Kubernetes deployments without changing application code. The stack includes web-based monitoring dashboards, request routing for load distribution, and KV cache offloading to optimize inference performance across cluster-wide deployments.