IBM/CrossViT
CrossViT is a vision transformer model that uses cross-attention across multiple scales for image classification.

Velocity · 7d
+0.2
★ / day
Trend
→steady
star history
This repository provides the official PyTorch implementation of CrossViT, a vision transformer architecture that combines multi-scale features through cross-attention mechanisms for improved image classification on ImageNet. The implementation includes training scripts, pretrained model weights, and supports distributed multi-GPU training.