← all repositories
harvardnlp/im2markup

Teaching neural nets to reverse-engineer LaTeX from screenshots

A 2016 Harvard NLP project that treats markup recovery as an image-to-sequence translation problem, complete with attention heatmaps.

1.3k stars Lua Computer Vision
im2markup
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

im2markup takes a rendered image—say, a screenshot of a LaTeX formula or a web page—and tries to spit back the source markup that generated it. Think of it as OCR for structure: not just reading the characters, but recovering the \frac, the ^, and the nested braces.

The interesting bit

The model marries a CNN for visual feature extraction with an attention-based sequence decoder, the kind of architecture that was still novel in 2016. The attention mechanism has a nice side effect: you get an explicit alignment between each generated token and the region of the image it came from, visualized as a heatmap over the original formula.

Key highlights

  • Two main demos: math-to-LaTeX and web-page-to-HTML, each with pretrained models available for download
  • Evaluation goes beyond text metrics (BLEU, edit distance) to image-level accuracy: it re-renders the predicted markup and compares pixel-by-pixel
  • Built in Torch/Lua, with Python handling preprocessing and evaluation; the README includes a full pipeline from raw images to scored predictions
  • GPU-only: the CNN path hard-depends on cuDNN
  • Ships with a 100k-formula dataset and a toy sample for quick experiments

Caveats

  • The stack is a museum piece: Torch, Lua, and dependencies like tds and nngraph that have been largely superseded by PyTorch
  • Preprocessing is finicky and brittle: the KaTeX parser throws errors on some formulas, and the pipeline filters out “too large” images and formulas with “grammar errors” without defining those thresholds
  • No support for CPU inference; you’ll need a CUDA setup from the mid-2010s or patience with legacy installs

Verdict

Worth studying if you’re writing a literature review on visual decompilation or attention-based OCR, or if you need to recover LaTeX from a corpus of formula images. Skip it if you want something production-ready; modern multimodal models handle this task with less archaeology.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.