A zoo of vision transformers, all in one pip install
lucidrains collected two dozen ViT variants so you don't have to reimplement them from scratch.

What it does
This is a PyTorch implementation of the original Vision Transformer paper, plus a sprawling menagerie of follow-up architectures. You get the base ViT, but also SimpleViT, NaViT, CaiT, DeepViT, T2T, CCT, LeViT, MobileViT, masked autoencoders, distillation wrappers, and roughly fifteen more. Each is importable and configurable with standard constructor arguments.
The interesting bit
The author admits “there’s really not much to code here” for the base model, but the value is curation: every variant comes with a paper citation, a minimal code example, and enough comments to map the research idea to the API. It’s a living literature review you can pip install.
Key highlights
- Original ViT plus 20+ variants (NaViT for variable resolutions, CCT for compact conv-based tokens, XCiT, MaxViT, etc.)
- Self-contained modules:
from vit_pytorch import ViTorfrom vit_pytorch.cait import CaiT - Distillation support with
DistillableViTand a.to_vit()method to convert back after training - NaViT supports nested tensors (PyTorch 2.5+) to skip masking/padding overhead
- Attention maps are accessible for visualization
Caveats
- No pretrained weights included; for those, the README points to Ross Wightman’s
timm - Some variants (e.g., NaViT) require manual batch grouping or nested tensors, which adds dataloading complexity
Verdict
Grab this if you’re prototyping vision transformer research or need a clean, hackable baseline. Skip it if you just want pretrained weights for production—use timm instead.