← all repositories

microsoft/sarathi-serve

Microsoft's Sarathi-Serve is a research LLM serving framework optimized for throughput-latency tradeoff, originally forked from vLLM.

506 stars Python Inference · Serving
sarathi-serve
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

Sarathi-Serve is a high-throughput and low-latency serving engine for large language models. The project is a research prototype that originated as a fork of vLLM, adapted specifically to tame the throughput-latency tradeoff in LLM inference. It targets H100 and A100 GPUs and was developed as part of an academic paper published at the 2024 USENIX OSDI conference.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.