Vision transformers, but make it signal processing
GFNet replaces self-attention with a learned FFT filter, cutting complexity from quadratic to log-linear while keeping global receptive fields.

What it does
GFNet is an image classification architecture that swaps the self-attention layer in vision transformers for frequency-domain operations. It runs a 2D FFT on spatial features, multiplies by learnable complex-valued “global filters,” then inverse-FFT back. The whole thing is ~20 lines of PyTorch and runs in O(n log n) instead of O(n²).
The interesting bit
The trick is that a single element-wise multiplication in frequency space acts as a global convolution in pixel space—every output location sees every input location, but through the FFT’s bookkeeping, not an explicit pairwise attention matrix. The authors visualize these learned filters and they actually look like structured frequency responses, not random noise.
Key highlights
- Pretrained ImageNet models from 7M to 54M parameters, top-1 accuracy 74.6%–82.9%
- Core
GlobalFilterlayer is 8 lines of PyTorch usingtorch.fft.rfft2/irfft2 - Requires PyTorch ≥1.8.0 for the FFT API; code builds on
timmand DeiT - Supports fine-tuning at higher resolution (384×384 shown) and transfer learning scripts included
- MIT licensed
Caveats
- The FFT assumes fixed input resolutions; the filter dimensions (
h=14, w=8) are hardcoded to the feature map size at that layer - No training from scratch on modern hardware configs (scripts show 8-GPU distributed launch, no single-GPU recipe)
- Jupyter Notebook repo language is misleading—it’s PyTorch code with some notebook visualizations
Verdict
Worth a look if you’re building vision models where quadratic attention is a bottleneck, especially at higher resolutions. Skip if you need flexible input sizes or want a drop-in replacement without thinking about frequency-domain shapes.