← all repositories

alibaba/rtp-llm

Alibaba's GPU-accelerated LLM inference engine for production serving at scale.

1.2k stars Cuda Inference · Serving
rtp-llm
Velocity · 7d
+1.3
★ / day
Trend
steady
star history

RTP-LLM is a Large Language Model inference acceleration engine developed by Alibaba’s Foundation Model Inference Team. It provides high-performance serving of LLMs with optimizations including Prefill/Decode separation, GPU memory management, and attention kernel optimizations in CUDA. The engine supports distributed inference across multiple business units and is designed for multi-hardware deployments including AMD ROCm, Intel CPU, and ARM CPU backends.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.