Diffusion meets language: a unified toolkit for non-autoregressive text generation
dLLM wraps training, inference, and evaluation recipes for diffusion language models into one reproducible codebase built on familiar Hugging Face tooling.

What it does
dLLM is a Python library that unifies the scattered world of diffusion language models—masked diffusion, block diffusion, edit flows—into a single training and evaluation pipeline. It sits on top of transformers Trainer, supports LoRA, DeepSpeed, and FSDP out of the box, and plugs into lm-evaluation-harness for benchmarking. The repo ships ready-made recipes for models like LLaDA, Dream, and even BERT-turned-chatbot, plus utilities to convert autoregressive checkpoints (Qwen, LLaMA, GPT-2) into diffusion variants.
The interesting bit
The project treats diffusion for text as an infrastructure problem, not just a research novelty. It includes GRPO reinforcement-learning training for reasoning tasks (GSM8K, MATH, Sudoku, Code) and Fast-dLLM inference acceleration with cache-aware decoding—suggesting the authors expect these models to actually be used, not just cited.
Key highlights
- Training recipes for LLaDA, Dream, LLaDA2.x, BERT-Chat, and Edit Flows with insertion/deletion/substitution operations
A2Dpipeline converts any autoregressive model to masked or block diffusion;Tiny-A2Dreleases 0.5B/0.6B checkpoints- Distributed training via Accelerate (DDP, ZeRO-1/2/3, FSDP) with optional 4-bit quantization and LoRA
- Evaluation through
lm-evaluation-harnesssubmodule; Slurm cluster scripts included diffu-GRPOreinforcement learning for diffusion models on reasoning benchmarks
Caveats
- The README notes this is “primarily for educational purposes” and does not aim for exact reproduction of official models
- Setup requires manual CUDA/PyTorch alignment and submodule initialization for evaluation
- Several demo GIFs and assets are commented out in the source, suggesting documentation is still being polished
Verdict Worth a look if you’re experimenting with non-autoregressive text generation or need a standardized baseline to compare diffusion architectures. Skip it if you want battle-tested, drop-in replacements for production autoregressive APIs.