jeonsworld/ViT-pytorch
A PyTorch reimplementation of the Vision Transformer model for image classification tasks.

Velocity · 7d
+1.1
★ / day
Trend
→steady
star history
This repository provides a PyTorch reimplementation of the Vision Transformer (ViT) architecture from the paper ‘An Image is Worth 16x16 Words’. The model applies transformer encoders directly to image patches for image recognition at scale. It includes support for loading Google’s official pretrained checkpoints, training on datasets like CIFAR-10 and ImageNet, and implements both pure ViT and hybrid ResNet+ViT variants across multiple model sizes from B-16 to H-14.