lucidrains/mmdit
PyTorch implementation of the Multi-Modal Diffusion Transformer (MMDiT) layer from Stable Diffusion 3.

This repository provides a PyTorch implementation of the MMDiT (Multi-Modal Diffusion Transformer) architecture introduced in the Stable Diffusion 3 paper by Esser et al. It implements the core attention mechanism that allows the model to jointly process text and image tokens during diffusion-based image generation. The implementation includes a single-block version and a generalized version supporting more than two modalities (text, image, audio, video). It also offers an adaptive attention variant using learned gating for dynamic weight selection.