OCR with a wandering eye: attention maps show where the model looks
A 2016 TensorFlow implementation that teaches a CNN-LSTM to read text by letting it peek at one character at a time.

What it does Feed it an image of a word; it resizes to height 32, runs a sliding CNN, stacks an LSTM on top, then uses an attention decoder to emit characters one by one. The output is the predicted text string, plus optional heatmaps showing which slice of the image the model stared at for each letter.
The interesting bit The attention visualization is the payoff: you get per-character attention maps overlaid on the original image, so you can literally watch the model’s eye move left-to-right (or jump around) as it decodes. It’s borrowed from the same Harvard NLP group that turned math images into LaTeX.
Key highlights
- End-to-end trainable: CNN feature extractor + LSTM encoder + attention decoder in one stack
- Ships with a pre-trained model on Synth 90K and evaluation data for ICDAR/IIIT5K/SVT benchmarks
--visualizeflag dumps attention overlays toresults/correctso you can debug misreads- Vocabulary is hardcoded to 39 tokens: digits, lowercase letters, plus padding/GO/EOS
- Built on TensorFlow 0.12.1 with Keras handling the conv layers
Caveats
- TensorFlow 0.12.1 is ancient; expect dependency archaeology to get it running
- Training “takes quite a long time to reach convergence” because the CNN and attention train jointly
- Model structure isn’t saved, only parameters—reconstruct the graph yourself or use their launcher
Verdict Worth a look if you need interpretable OCR or want to understand attention mechanisms in the wild. Skip it if you need production OCR today; modern transformers and cloud APIs have lapped this.