microsoft/CvT
A deep learning model combining convolutional layers with Vision Transformer architecture for high-performance image classification.

Convolutional Vision Transformers (CvT) merge convolutional neural networks with transformer architecture by introducing a hierarchy of transformers with convolutional token embeddings and convolutional transformer block projections. The model achieves state-of-the-art ImageNet classification accuracy with fewer parameters than comparable Vision Transformers and ResNets, and can optionally omit positional encodings for higher resolution inputs.