baofff/U-ViT
A PyTorch implementation of U-ViT, a Vision Transformer backbone for diffusion models used in image generation tasks.

The repository provides an official implementation of a ViT-based architecture that replaces traditional CNN-based U-Nets in diffusion models. It treats all inputs including time, condition, and noisy image patches as tokens and uses long skip connections between shallow and deep layers. The model is evaluated on unconditional and class-conditional image generation as well as text-to-image generation tasks, achieving FID scores of 2.29 on ImageNet 256x256 and 5.48 on MS-COCO for text-to-image generation.