Gen-Verse/MMaDA
Multimodal diffusion language model that unifies textual reasoning, visual understanding, and text-to-image generation in a single 8B parameter architecture.

MMaDA is a family of multimodal diffusion foundation models that replaces autoregressive generation with diffusion-based token prediction across modalities. It uses a unified diffusion architecture with modality-agnostic design to handle text, images, and their combinations. The model incorporates mixed chain-of-thought reasoning and unified reinforcement learning training for improved reasoning capabilities across textual and visual tasks.