Teaching images and captions to pay attention to each other
A 2018 ECCV paper that makes image-text matching bidirectional by having each modality attend to the other, rather than fusing them into a single vector and hoping for the best.

What it does
SCAN (Stacked Cross Attention Network) learns to match images with text captions by computing fine-grained alignment between image regions and words, rather than collapsing both into a single shared embedding. It supports two directions: text-to-image attention (find the relevant image regions for each word) and image-to-text attention (find the relevant words for each image region). The code reproduces the ECCV 2018 paper from Microsoft Research, built as a fork of the VSE++ framework.
The interesting bit
The clever part is that SCAN doesn’t just do cross-attention once—it stacks it, and it uses aggregation functions (LogSumExp or simple averaging) to pool the resulting similarity scores. The README includes exact command-line flags for reproducing each variant, which is the kind of detail that saves hours of head-scratching.
Key highlights
- Pre-computed bottom-up attention features for Flickr30K and MS-COCO available via Kaggle dataset
- Four model variants with documented hyperparameters: t-i LSE, t-i AVG, i-t LSE, i-t AVG
- Built on PyTorch 0.3 (yes, that old) with Python 2.7
- Includes evaluation script with 5-fold cross-validation support for MS-COCO
- Apache 2.0 licensed
Caveats
- Dependencies are frozen in 2018: PyTorch 0.3 and Python 2.7 will require environment archaeology to run today
- No candidate images provided for the repository
Verdict
Worth a look if you’re researching cross-modal retrieval or need a baseline for image-text matching with explicit attention mechanisms. Skip it if you need something that runs out of the box on modern PyTorch—you’ll be porting code before you get results.