Teaching neural nets to reverse-engineer LaTeX from screenshots
A 2016 Harvard NLP project that treats markup recovery as an image-to-sequence translation problem, complete with attention heatmaps.

What it does
im2markup takes a rendered image—say, a screenshot of a LaTeX formula or a web page—and tries to spit back the source markup that generated it. Think of it as OCR for structure: not just reading the characters, but recovering the \frac, the ^, and the nested braces.
The interesting bit
The model marries a CNN for visual feature extraction with an attention-based sequence decoder, the kind of architecture that was still novel in 2016. The attention mechanism has a nice side effect: you get an explicit alignment between each generated token and the region of the image it came from, visualized as a heatmap over the original formula.
Key highlights
- Two main demos: math-to-LaTeX and web-page-to-HTML, each with pretrained models available for download
- Evaluation goes beyond text metrics (BLEU, edit distance) to image-level accuracy: it re-renders the predicted markup and compares pixel-by-pixel
- Built in Torch/Lua, with Python handling preprocessing and evaluation; the README includes a full pipeline from raw images to scored predictions
- GPU-only: the CNN path hard-depends on cuDNN
- Ships with a 100k-formula dataset and a toy sample for quick experiments
Caveats
- The stack is a museum piece: Torch, Lua, and dependencies like tds and nngraph that have been largely superseded by PyTorch
- Preprocessing is finicky and brittle: the KaTeX parser throws errors on some formulas, and the pipeline filters out “too large” images and formulas with “grammar errors” without defining those thresholds
- No support for CPU inference; you’ll need a CUDA setup from the mid-2010s or patience with legacy installs
Verdict
Worth studying if you’re writing a literature review on visual decompilation or attention-based OCR, or if you need to recover LaTeX from a corpus of formula images. Skip it if you want something production-ready; modern multimodal models handle this task with less archaeology.