Is lmdeploy open source?

Yes — InternLM/lmdeploy is open source, released under the Apache-2.0 license.

What language is lmdeploy written in?

InternLM/lmdeploy is primarily written in Python.

How popular is lmdeploy?

InternLM/lmdeploy has 8k stars on GitHub.

Where can I find lmdeploy?

InternLM/lmdeploy is on GitHub at https://github.com/InternLM/lmdeploy.

← all repositories

InternLM/lmdeploy

A dual-engine LLM server that wants to outrun vLLM

LMDeploy is a compression and serving toolkit that claims up to 1.8× the throughput of vLLM by pairing a custom CUDA engine with a pure-Python fallback.

★8k stars Python Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

LMDeploy compresses, deploys, and serves large language and vision models through two separate inference backends. TurboMind targets raw performance with persistent batching, blocked KV caches, and custom kernels, while the PyTorchEngine offers a pure-Python environment for rapid experimentation. The toolkit also coordinates multi-model, multi-machine services and supports weight-only and KV-cache quantization.

The interesting bit

The project is unapologetically competitive with vLLM, citing 1.8× higher throughput and 2.4× speedups from 4-bit inference in its own materials. Rather than offering one engine to rule them all, it splits the ecosystem into a speed-optimized TurboMind track and a developer-friendly PyTorchEngine track, each with different model coverage and data-type support.

Key highlights

Claims up to 1.8× higher request throughput than vLLM via TurboMind, using continuous batching and custom CUDA kernels.
Supports both weight-only (AWQ) and online KV-cache quantization (int8/int4), with 4-bit inference reportedly 2.4× faster than FP16.
Two distinct engines: TurboMind for performance, PyTorchEngine for accessibility; model and dtype support differs between them.
Extensive model roster, including recent releases like DeepSeek-V3/R1, Qwen3.5, Llama4, and a wide array of VLMs.
Recent additions include MXFP4 support on NVIDIA GPUs (V100+), DeepSeek PD disaggregation via Mooncake/DLSlime, and Huawei Ascend compatibility.

Caveats

The visible benchmark chart is labeled v0.1.0-benchmark, so the age and current validity of those specific figures is unclear.
Because the two engines support different models and data types, you must consult a compatibility matrix before choosing a backend.
The project briefly hit a PyPI storage quota wall in early 2026; current wheels default to CUDA 12.8.

Verdict

Worth evaluating if you need quantized or multimodal inference at scale on NVIDIA or Ascend hardware. Look elsewhere if you prefer a single backend where every model behaves identically.

Frequently asked

What is InternLM/lmdeploy?: LMDeploy is a compression and serving toolkit that claims up to 1.8× the throughput of vLLM by pairing a custom CUDA engine with a pure-Python fallback.
Is lmdeploy open source?: Yes — InternLM/lmdeploy is open source, released under the Apache-2.0 license.
What language is lmdeploy written in?: InternLM/lmdeploy is primarily written in Python.
How popular is lmdeploy?: InternLM/lmdeploy has 8k stars on GitHub.
Where can I find lmdeploy?: InternLM/lmdeploy is on GitHub at https://github.com/InternLM/lmdeploy.