Teaching neural nets to look before they speak
A TensorFlow 1.2 implementation that makes image captioning models point at what they're talking about.

What it does
This repo implements “Show, Attend and Tell,” a 2015 paper that generates image captions using visual attention. The model feeds MSCOCO images through VGGNet19, then produces captions word-by-word while shifting its “gaze” to relevant image regions. You get both the sentence and a heatmap of where the model looked for each word.
The interesting bit
The attention visualization is the payoff: you can literally watch the model focus on “landing gear” versus “sky” as it builds a sentence. It’s a rare case where interpretability comes free with the architecture rather than being bolted on afterward.
Key highlights
- Complete pipeline: data download, preprocessing, training, and evaluation via Jupyter notebook
- TensorBoard logging included for real-time loss monitoring
- Preprocessing script generates image feature vectors up front, so training avoids repeated VGG forward passes
- Evaluation notebook generates captions, attention visualizations, and model metrics in one shot
- Python 2.7 and TensorFlow 1.2 — very much of its era
Caveats
- Stuck on TensorFlow 1.2 and Python 2.7; you’ll need to resurrect a 2016 environment or port it yourself
- README warns that MSCOCO download “may take several hours” — plan accordingly
- No pretrained model weights provided; you train from scratch
Verdict
Worth a look if you’re studying attention mechanisms or need a pedagogical baseline for image captioning. Skip it if you want a production-ready model; modern alternatives (CLIP, BLIP) run circles around this on every axis except interpretability.