← all repositories
lucidrains/vit-pytorch

A zoo of vision transformers, all in one pip install

lucidrains collected two dozen ViT variants so you don't have to reimplement them from scratch.

25.3k stars Python ML FrameworksComputer Vision
vit-pytorch
Velocity · 7d
+12
★ / day
Trend
steady
star history

What it does

This is a PyTorch implementation of the original Vision Transformer paper, plus a sprawling menagerie of follow-up architectures. You get the base ViT, but also SimpleViT, NaViT, CaiT, DeepViT, T2T, CCT, LeViT, MobileViT, masked autoencoders, distillation wrappers, and roughly fifteen more. Each is importable and configurable with standard constructor arguments.

The interesting bit

The author admits “there’s really not much to code here” for the base model, but the value is curation: every variant comes with a paper citation, a minimal code example, and enough comments to map the research idea to the API. It’s a living literature review you can pip install.

Key highlights

  • Original ViT plus 20+ variants (NaViT for variable resolutions, CCT for compact conv-based tokens, XCiT, MaxViT, etc.)
  • Self-contained modules: from vit_pytorch import ViT or from vit_pytorch.cait import CaiT
  • Distillation support with DistillableViT and a .to_vit() method to convert back after training
  • NaViT supports nested tensors (PyTorch 2.5+) to skip masking/padding overhead
  • Attention maps are accessible for visualization

Caveats

  • No pretrained weights included; for those, the README points to Ross Wightman’s timm
  • Some variants (e.g., NaViT) require manual batch grouping or nested tensors, which adds dataloading complexity

Verdict

Grab this if you’re prototyping vision transformer research or need a clean, hackable baseline. Skip it if you just want pretrained weights for production—use timm instead.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.