Yes — microsoft/CvT is open source, released under the MIT license.

What language is CvT written in?

microsoft/CvT is primarily written in Python.

microsoft/CvT has 609 stars on GitHub.

Where can I find CvT?

microsoft/CvT is on GitHub at https://github.com/microsoft/CvT.

microsoft/CvT

CvT gives vision transformers a convolutional transplant

CvT exists to fix vision transformers by stealing CNN tricks—convolutional token embeddings and projections—so the hybrid models run leaner and score higher on ImageNet.

★609 stars Python Computer Vision

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does CvT is a vision backbone that injects convolutions into the Vision Transformer architecture. It swaps the standard patch embedding and linear projection for convolutional token embedding and convolutional Transformer blocks, aiming to keep global attention while recovering the shift, scale, and distortion invariance that CNNs enjoy. Microsoft provides the official PyTorch implementation and pretrained checkpoints for several scales.

The interesting bit Because convolutions already bake spatial structure into the tokens, CvT can drop positional encoding entirely—normally a crucial component in existing vision transformers. The design also maintains a hierarchy of transformers rather than flattening the image into a single sequence, which helps efficiency.

Key highlights

CvT-13 hits 81.6% top-1 accuracy on ImageNet-1k with 20M parameters and 4.5 GFLOPs.
The CvT-W24 variant reaches 87.7% top-1 after pretraining on ImageNet-22k.
Positional encoding is unnecessary, which the authors argue simplifies higher-resolution tasks.
The architecture introduces two specific changes: a convolutional token embedding and a convolutional projection inside the Transformer blocks.
Pretrained weights are available through a model zoo.

Caveats

The code was developed and tested only on PyTorch 1.7.1; other versions are not fully tested.

Verdict Computer vision practitioners hunting for a leaner, more accurate image classification backbone than vanilla ViT or ResNet should care. If your pipeline does not involve ImageNet-scale vision models, this is strictly academic scenery.

Frequently asked

What is microsoft/CvT?: CvT exists to fix vision transformers by stealing CNN tricks—convolutional token embeddings and projections—so the hybrid models run leaner and score higher on ImageNet.
Is CvT open source?: Yes — microsoft/CvT is open source, released under the MIT license.
What language is CvT written in?: microsoft/CvT is primarily written in Python.
How popular is CvT?: microsoft/CvT has 609 stars on GitHub.
Where can I find CvT?: microsoft/CvT is on GitHub at https://github.com/microsoft/CvT.