apple/ml-fastvit
FastViT is a hybrid vision transformer architecture for image classification that uses structural reparameterization to achieve efficient inference on mobile devices.

This repository provides the official PyTorch implementation of FastViT, a hybrid vision transformer that combines convolutional and transformer layers for image classification. The model uses structural reparameterization to convert multi-branch training blocks into efficient single-branch inference blocks. All models are trained on ImageNet-1K and benchmarked on mobile devices including iPhone 12 Pro. The implementation includes training code, pre-trained checkpoints, and CoreML models for deployment.