Is rtp-llm open source?

Yes — alibaba/rtp-llm is open source, released under the Apache-2.0 license.

What language is rtp-llm written in?

alibaba/rtp-llm is primarily written in Cuda.

How popular is rtp-llm?

alibaba/rtp-llm has 1.3k stars on GitHub.

Where can I find rtp-llm?

alibaba/rtp-llm is on GitHub at https://github.com/alibaba/rtp-llm.

← all repositories

alibaba/rtp-llm

Alibaba's inference engine that runs Taobao's LLM traffic

A production-hardened serving stack built on FasterTransformer, handling search, chat, and multimodal workloads across Alibaba's commerce empire.

★1.3k stars Cuda Inference · Serving

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

RTP-LLM is Alibaba’s internal LLM inference engine, now open-sourced, that powers real-world services across Taobao, Tmall, Lazada, Ele.me, and Cainiao. It handles the full serving stack: model loading, quantization, batching, prefix caching, speculative decoding, and multi-GPU tensor parallelism. The project is a sub-project of Havenask, Alibaba’s search engine framework.

The interesting bit

The engine is explicitly a descendant of NVIDIA’s FasterTransformer with kernels borrowed from TensorRT-LLM, plus ideas lifted from vLLM — so it’s not reinventing the wheel, but rather battle-testing and extending proven designs at massive scale. The README notes a 2024 rewrite of the scheduling and batching framework in C++ with “complete GPU memory management,” suggesting the team hit scaling walls in the original architecture.

Key highlights

Production footprint: powers Taobao Wenwen, OpenSearch LLM Q&A, and Alibaba’s international AI platform Aidge
Quantization buffet: WeightOnly INT8 (auto-quantized at load), INT4 via GPTQ/AWQ, plus adaptive KVCache compression
Hardware spread: CUDA-first with V100-specific optimizations, plus emerging ARM CPU (Yitian), Intel CPU, and AMD ROCm support
Multimodal and multi-tenant: handles image+text inputs and serves multiple LoRA adapters from one model instance
Caching smarts: contextual prefix cache for multi-turn chat, system prompt cache, and pruned irregular model loading

Caveats

The “multi-hardware support” has been “coming soon” since June 2024; only CUDA and Yitian ARM are confirmed shipping
Benchmark numbers are referenced but not actually included in the README — you’ll need to follow the external docs link

Verdict

Worth studying if you’re building high-throughput LLM serving for e-commerce or search-scale traffic, especially on NVIDIA stacks. Probably overkill if you’re running a single-model hobby deployment — vLLM or TGI will get you there faster.

Frequently asked

What is alibaba/rtp-llm?: A production-hardened serving stack built on FasterTransformer, handling search, chat, and multimodal workloads across Alibaba's commerce empire.
Is rtp-llm open source?: Yes — alibaba/rtp-llm is open source, released under the Apache-2.0 license.
What language is rtp-llm written in?: alibaba/rtp-llm is primarily written in Cuda.
How popular is rtp-llm?: alibaba/rtp-llm has 1.3k stars on GitHub.
Where can I find rtp-llm?: alibaba/rtp-llm is on GitHub at https://github.com/alibaba/rtp-llm.