xxxnell/how-do-vits-work
PyTorch implementation of an ICLR 2022 paper analyzing how Vision Transformers work in computer vision.

This repository provides the official implementation of a peer-reviewed research paper studying the mechanics of Vision Transformers. It investigates how Multi-head Self-Attention (MSA) modules benefit neural networks, examining their role as spatial smoothings versus long-range dependency capturers. The work introduces AlterNet, a hybrid architecture combining CNNs and MSAs at stage ends, and provides analysis tools for loss landscapes and frequency responses of attention mechanisms.