← all repositories
lightseekorg/tokenspeed

Another LLM engine claims the speed crown, but this one ships

TokenSpeed wants to beat TensorRT-LLM on throughput without making you write manual parallelism code.

1.4k stars Python Inference · ServingAgents
tokenspeed
Velocity · 7d
+42
★ / day
Trend
steady
star history

What it does TokenSpeed is a GPU inference engine for large language models, pitched squarely at “agentic workloads” — think multi-step tool calls, long contexts, high concurrency. It promises TensorRT-LLM-level throughput with a vLLM-style developer experience. The stack covers modeling, scheduling, kernels, and an AsyncLLM server entrypoint.

The interesting bit The parallelism logic is the unusual angle. You annotate module boundaries; a static compiler generates the collective communication. No hand-written NCCL spaghetti. The scheduler is also deliberately over-engineered in a good way: request lifecycle, KV cache ownership, and overlap timing are encoded as a finite-state machine, with safe KV reuse enforced by the C++ type system at compile time. That is the kind of boring part that prevents 3 AM production fires.

Key highlights

  • Static compiler generates distributed communication from placement annotations
  • C++ control plane + Python execution plane; scheduler uses FSM semantics
  • Pluggable kernel registry with a fast MLA (Multi-head Latent Attention) implementation on Blackwell
  • SMG-integrated AsyncLLM entrypoint for low-overhead CPU request handling
  • Published benchmark: 580 TPS on Qwen3.5-397B-A17B for agentic workloads
  • Explicitly targets Kimi K2.5, Qwen, DeepSeek, MiniMax, Nemotron families

Caveats

  • Preview release only; README warns “do not use for production deployments”
  • Several major PRs still in progress; model coverage and runtime features (PD, EPLB, KV store, VLM, metrics) are pending merge
  • Currently optimized for B200/Blackwell; Hopper and MI350 support is on the roadmap, not shipped

Verdict Worth watching if you run high-concurrency agentic inference on NVIDIA Blackwell and are tired of hand-tuning distributed parallelism. Skip it if you need production stability today or AMD/older GPU support.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.