Another LLM engine claims the speed crown, but this one ships
TokenSpeed wants to beat TensorRT-LLM on throughput without making you write manual parallelism code.

What it does TokenSpeed is a GPU inference engine for large language models, pitched squarely at “agentic workloads” — think multi-step tool calls, long contexts, high concurrency. It promises TensorRT-LLM-level throughput with a vLLM-style developer experience. The stack covers modeling, scheduling, kernels, and an AsyncLLM server entrypoint.
The interesting bit The parallelism logic is the unusual angle. You annotate module boundaries; a static compiler generates the collective communication. No hand-written NCCL spaghetti. The scheduler is also deliberately over-engineered in a good way: request lifecycle, KV cache ownership, and overlap timing are encoded as a finite-state machine, with safe KV reuse enforced by the C++ type system at compile time. That is the kind of boring part that prevents 3 AM production fires.
Key highlights
- Static compiler generates distributed communication from placement annotations
- C++ control plane + Python execution plane; scheduler uses FSM semantics
- Pluggable kernel registry with a fast MLA (Multi-head Latent Attention) implementation on Blackwell
- SMG-integrated AsyncLLM entrypoint for low-overhead CPU request handling
- Published benchmark: 580 TPS on Qwen3.5-397B-A17B for agentic workloads
- Explicitly targets Kimi K2.5, Qwen, DeepSeek, MiniMax, Nemotron families
Caveats
- Preview release only; README warns “do not use for production deployments”
- Several major PRs still in progress; model coverage and runtime features (PD, EPLB, KV store, VLM, metrics) are pending merge
- Currently optimized for B200/Blackwell; Hopper and MI350 support is on the roadmap, not shipped
Verdict Worth watching if you run high-concurrency agentic inference on NVIDIA Blackwell and are tired of hand-tuning distributed parallelism. Skip it if you need production stability today or AMD/older GPU support.