← all repositories
z-lab/dflash

A diffusion model that drafts tokens in blocks, not one-by-one

DFlash speeds up LLM inference by using a lightweight block diffusion model for speculative decoding, letting the main model verify chunks in parallel rather than waiting on autoregressive guesses.

dflash
Velocity · 7d
+32
★ / day
Trend
steady
star history

What it does DFlash trains small “draft” models that predict multiple future tokens simultaneously using block diffusion, then feeds those blocks to a full-size target model for verification. If the draft is good, the target model accepts a chunk at once; if not, it falls back to normal generation. The project provides pre-trained draft models for Gemma-4, Qwen3.x, Kimi, MiniMax, and others, plus integration code for vLLM, SGLang, Transformers, and MLX.

The interesting bit Most speculative decoders use tiny autoregressive models that still generate one token at a time. DFlash replaces that with a diffusion process operating on blocks, which can draft more aggressively in parallel. The trade-off is a more complex training recipe — the authors note they will open-source it “soon” — but the payoff is higher-quality drafts at larger step sizes.

Key highlights

  • Pre-trained draft models for 20+ target models, from 4B to 120B parameters
  • Native support in vLLM (v0.20.1+), SGLang, HuggingFace Transformers, and MLX for Apple Silicon
  • Benchmark harness included for gsm8k, math500, humaneval, mbpp, and mt-bench
  • Docker image provided for Gemma-4 with a patched vLLM build
  • Community implementations acknowledged; official MLX version tested on M5 Pro

Caveats

  • Gemma-4 support currently requires a temporary vLLM fork or Docker image, not the standard release
  • Transformers backend only works with Qwen3 and LLaMA-3.1 models
  • Training recipe is not yet released, so you cannot currently train custom draft models
  • Several listed models (DeepSeek-V4, GLM-5.1) are marked “Coming soon”

Verdict Worth a look if you run inference at scale and want to squeeze latency out of supported models. Skip it if you need to train your own draft model today, or if your target model isn’t on the compatibility list yet.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.