thu-ml/unidiffuser
A unified diffusion framework that performs image generation, text generation, text-to-image, and image-to-text synthesis in a single transformer model.

This repository implements a multi-modal diffusion model that unifies marginal, conditional, and joint distributions for image-text data. The approach perturbs data across all modalities simultaneously and uses a transformer backbone to predict noise for each modality with individual timesteps. The model is trained on large-scale paired image-text data and can perform diverse generation tasks by setting appropriate timesteps without architectural changes.