Attention is convolution in a trenchcoat
This ICLR 2020 paper proves self-attention can express any convolutional layer—and shows that trained models often learn to do exactly that.

What it does
This repository holds the code for a paper that asks a blunt question: when vision transformers use self-attention, are they secretly just doing convolutions? The authors prove mathematically that a multi-head self-attention layer with enough heads can represent any convolutional layer. Then they run experiments to check whether trained attention layers actually converge toward convolution-like behavior. Spoiler: they often do.
The interesting bit
The proof is constructive, not just an existence argument. The authors also built an interactive website where you can poke at attention patterns directly, which is rarer than it should be for a math-heavy paper.
Key highlights
- Formal proof that multi-head self-attention subsumes convolutional layers (given sufficient heads)
- Empirical validation that learned attention heads learn convolution-like patterns in practice
- Reproducible experiments via shell scripts in
runs/ - Interactive visualization at epfml.github.io/attention-cnn
- ICLR 2020; 1,121 stars
Caveats
- Setup instructions specify CUDA 10.0 and Anaconda, so modern environments may need massaging
- The repo is research code: expect paper reproduction scripts, not a maintained library
Verdict
Worth a look if you’re trying to understand why vision transformers work—or if you need ammunition for arguments about inductive biases. Skip if you want production-ready attention primitives.