Is U-ViT open source?

Yes — baofff/U-ViT is open source, released under the MIT license.

What language is U-ViT written in?

baofff/U-ViT is primarily written in Jupyter Notebook.

How popular is U-ViT?

baofff/U-ViT has 1.1k stars on GitHub.

Where can I find U-ViT?

baofff/U-ViT is on GitHub at https://github.com/baofff/U-ViT.

← all repositories

baofff/U-ViT

Diffusion models work fine without CNN up- and downsampling

It tests whether diffusion models need a CNN U-Net, or if a Vision Transformer with long skip connections is enough.

★1.1k stars Jupyter Notebook Image · Video · Audio

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

U-ViT is a transformer backbone built to replace the convolutional U-Net in diffusion models. It flattens the noisy image, timestamps, and conditioning into a single stream of tokens, then processes them through a Vision Transformer with long skip connections bridging shallow and deep layers. The repository provides pretrained weights, training scripts for pixel-space and latent diffusion, and evaluation code across CIFAR-10, ImageNet, and MS-COCO.

The interesting bit

The authors argue that long skip connections are the critical ingredient, implying that the downsampling and upsampling operators in standard U-Nets are not strictly necessary for diffusion-based image modeling. Their best latent diffusion model reportedly achieves an FID of 2.29 on class-conditional ImageNet 256×256 and 5.48 on MS-COCO text-to-image without accessing large external datasets during generative training.

Key highlights

Treats time embeddings, class labels, and noisy image patches as interchangeable tokens rather than separate modalities
Ships with pretrained models from CIFAR-10 up to ImageNet 512×512, plus a Colab demo for the 2.29 FID ImageNet checkpoint
Training scripts cover pixel-space diffusion, latent diffusion with continuous or discrete timesteps, and text-to-image generation
Includes memory optimizations like gradient checkpointing and optional xformers attention, enabling training of the largest U-ViT-H/2 variant on high-resolution ImageNet with a batch size of 1024 on two A100 GPUs
Built on timm==0.3.2, which the authors note requires a manual patch for PyTorch 1.8.1+

Caveats

Depends on a specific, older timm version that needs a community fix to function with modern PyTorch releases
Pretrained latent diffusion models require downloading Stable Diffusion autoencoders and extracted feature files separately; the setup is not self-contained

Verdict

Worth a look if you are building or fine-tuning diffusion backbones and want a drop-in transformer alternative to the U-Net. Skip it if you are just looking for a turnkey Stable Diffusion wrapper—this is research infrastructure, not a consumer app.

Frequently asked

What is baofff/U-ViT?: It tests whether diffusion models need a CNN U-Net, or if a Vision Transformer with long skip connections is enough.
Is U-ViT open source?: Yes — baofff/U-ViT is open source, released under the MIT license.
What language is U-ViT written in?: baofff/U-ViT is primarily written in Jupyter Notebook.
How popular is U-ViT?: baofff/U-ViT has 1.1k stars on GitHub.
Where can I find U-ViT?: baofff/U-ViT is on GitHub at https://github.com/baofff/U-ViT.