Is Fast-dLLM open source?

Yes — NVlabs/Fast-dLLM is open source, released under the Apache-2.0 license.

What language is Fast-dLLM written in?

NVlabs/Fast-dLLM is primarily written in Python.

How popular is Fast-dLLM?

NVlabs/Fast-dLLM has 1.1k stars on GitHub.

Where can I find Fast-dLLM?

NVlabs/Fast-dLLM is on GitHub at https://github.com/NVlabs/Fast-dLLM.

← all repositories

NVlabs/Fast-dLLM

Block diffusion speeds up text, vision, and driving models

A family of acceleration techniques that makes diffusion-based language, vision, and driving models generate faster via KV caches, block diffusion, and speculative decoding.

★1.1k stars Python Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does Fast-dLLM is a collection of inference acceleration techniques for diffusion-based transformers, covering plain text (Dream, LLaDA, Qwen2.5), vision-language (Qwen2.5-VL), and end-to-end autonomous driving on Waymo. The repository ships code and model checkpoints for four variants: a training-free accelerator that adds KV caching and parallel decoding to existing diffusion LLMs; a block-diffusion text model with hierarchical caching; a direct-conversion block-diffusion VLM; and a section-aware structured diffusion VLA that outputs driving plans. Each variant lives in its own subdirectory with dedicated evaluation scripts and pre-trained weights on HuggingFace.

The interesting bit The project treats autoregressive and diffusion models as convertible formats rather than opposing religions: the VLM variant directly converts an autoregressive Qwen2.5-VL into a block-diffusion model, and the autonomous-driving VLA uses scaffold speculative decoding to hit over 200 TPS on a single H100 while claiming SOTA ADE and RFS metrics. It is a rare case where a single repository ships everything from a chatbot web demo to a Waymo end-to-end driving evaluator.

Key highlights

v1 accelerates existing diffusion LLMs training-free with KV cache and parallel decoding.
v2 introduces block diffusion with fine-tuning and hierarchical caching for Qwen2.5.
Fast-dVLM reports up to 6.18× speedup over the autoregressive baseline across 11 benchmarks while matching quality.
Fast-dDrive claims up to 12× speedup over the AR baseline with SGLang and runs at over 200 TPS on a single H100.
Accepted at ICLR 2026; includes an online demo and Gradio chatbot interfaces.

Caveats

vLLM support is still pending (marked 🚀 in the TODO list).
The training code relies on a vendored LMFlow fork under third_party/, so you are signing up for a custom dependency tree.

Verdict If you are working with diffusion transformers and need faster inference in text, vision, or driving domains, this repo gives you concrete code and checkpoints to test. If you are looking for a drop-in vLLM integration or a fully packaged training framework, wait for the next release.

Frequently asked

What is NVlabs/Fast-dLLM?: A family of acceleration techniques that makes diffusion-based language, vision, and driving models generate faster via KV caches, block diffusion, and speculative decoding.
Is Fast-dLLM open source?: Yes — NVlabs/Fast-dLLM is open source, released under the Apache-2.0 license.
What language is Fast-dLLM written in?: NVlabs/Fast-dLLM is primarily written in Python.
How popular is Fast-dLLM?: NVlabs/Fast-dLLM has 1.1k stars on GitHub.
Where can I find Fast-dLLM?: NVlabs/Fast-dLLM is on GitHub at https://github.com/NVlabs/Fast-dLLM.