← all repositories
kuanghuei/SCAN

Teaching images and captions to pay attention to each other

A 2018 ECCV paper that makes image-text matching bidirectional by having each modality attend to the other, rather than fusing them into a single vector and hoping for the best.

SCAN
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

SCAN (Stacked Cross Attention Network) learns to match images with text captions by computing fine-grained alignment between image regions and words, rather than collapsing both into a single shared embedding. It supports two directions: text-to-image attention (find the relevant image regions for each word) and image-to-text attention (find the relevant words for each image region). The code reproduces the ECCV 2018 paper from Microsoft Research, built as a fork of the VSE++ framework.

The interesting bit

The clever part is that SCAN doesn’t just do cross-attention once—it stacks it, and it uses aggregation functions (LogSumExp or simple averaging) to pool the resulting similarity scores. The README includes exact command-line flags for reproducing each variant, which is the kind of detail that saves hours of head-scratching.

Key highlights

  • Pre-computed bottom-up attention features for Flickr30K and MS-COCO available via Kaggle dataset
  • Four model variants with documented hyperparameters: t-i LSE, t-i AVG, i-t LSE, i-t AVG
  • Built on PyTorch 0.3 (yes, that old) with Python 2.7
  • Includes evaluation script with 5-fold cross-validation support for MS-COCO
  • Apache 2.0 licensed

Caveats

  • Dependencies are frozen in 2018: PyTorch 0.3 and Python 2.7 will require environment archaeology to run today
  • No candidate images provided for the repository

Verdict

Worth a look if you’re researching cross-modal retrieval or need a baseline for image-text matching with explicit attention mechanisms. Skip it if you need something that runs out of the box on modern PyTorch—you’ll be porting code before you get results.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.