microsoft/sarathi-serve
Microsoft's Sarathi-Serve is a research LLM serving framework optimized for throughput-latency tradeoff, originally forked from vLLM.

Velocity · 7d
+0.5
★ / day
Trend
→steady
star history
Sarathi-Serve is a high-throughput and low-latency serving engine for large language models. The project is a research prototype that originated as a fork of vLLM, adapted specifically to tame the throughput-latency tradeoff in LLM inference. It targets H100 and A100 GPUs and was developed as part of an academic paper published at the 2024 USENIX OSDI conference.