Shine a flashlight on your CNN's reasoning
Grad-CAM generates visual heatmaps showing exactly which image regions convinced a neural network to say "cat" instead of "dog."

What it does Grad-CAM produces coarse localization maps highlighting the important regions in an image for predicting a concept — essentially asking a CNN “where are you looking?” The repo bundles three Torch/Lua scripts: one for image classification, one for visual question answering, and one for image captioning. Feed it an image and a target label (or VQA answer, or caption), and it spits out a heatmap overlay.
The interesting bit The technique works without retraining or architectural surgery — it hooks into existing Caffe models (VGG, AlexNet) via gradient flow through a chosen convolutional layer. The VQA and captioning demos are the real flex: you can force the model to explain why it answered “green” versus “yellow” for the same fire-hydrant image, revealing how fragile or context-dependent its “reasoning” is.
Key highlights
- Classification, VQA, and captioning pipelines in one repo
- Uses pretrained Caffe models; no fine-tuning required
- GPU/CPU toggle, heatmap or raw output, layer selection all exposed as CLI flags
- BSD license; submodules pull in VQA_LSTM_CNN and neuraltalk2 dependencies
- Live demo at gradcam.cloudcv.org if you want to skip the Lua toolchain
Caveats
- Built for Torch7 and Caffe — a 2017-era stack that now feels archaeological
- Requires manual submodule init and model downloads; setup is not one-command
- Layer names default to VGG-specific values (
relu5_3,relu5_4), so using other architectures means reading the source
Verdict
Grab this if you’re doing interpretability research on legacy models or need a reproducible baseline for Grad-CAM citations. Skip it if you want modern PyTorch/TensorFlow implementations — those exist elsewhere and won’t make you wrangle .caffemodel files.