Is im2markup open source?

Yes — harvardnlp/im2markup is open source, released under the MIT license.

What language is im2markup written in?

harvardnlp/im2markup is primarily written in Lua.

How popular is im2markup?

harvardnlp/im2markup has 1.3k stars on GitHub.

Where can I find im2markup?

harvardnlp/im2markup is on GitHub at https://github.com/harvardnlp/im2markup.

← all repositories

harvardnlp/im2markup

Teaching neural nets to reverse-engineer LaTeX from screenshots

A 2016 Harvard NLP project that treats markup recovery as an image-to-sequence translation problem, complete with attention heatmaps.

★1.3k stars Lua Computer Vision

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

im2markup takes a rendered image—say, a screenshot of a LaTeX formula or a web page—and tries to spit back the source markup that generated it. Think of it as OCR for structure: not just reading the characters, but recovering the \frac, the ^, and the nested braces.

The interesting bit

The model marries a CNN for visual feature extraction with an attention-based sequence decoder, the kind of architecture that was still novel in 2016. The attention mechanism has a nice side effect: you get an explicit alignment between each generated token and the region of the image it came from, visualized as a heatmap over the original formula.

Key highlights

Two main demos: math-to-LaTeX and web-page-to-HTML, each with pretrained models available for download
Evaluation goes beyond text metrics (BLEU, edit distance) to image-level accuracy: it re-renders the predicted markup and compares pixel-by-pixel
Built in Torch/Lua, with Python handling preprocessing and evaluation; the README includes a full pipeline from raw images to scored predictions
GPU-only: the CNN path hard-depends on cuDNN
Ships with a 100k-formula dataset and a toy sample for quick experiments

Caveats

The stack is a museum piece: Torch, Lua, and dependencies like tds and nngraph that have been largely superseded by PyTorch
Preprocessing is finicky and brittle: the KaTeX parser throws errors on some formulas, and the pipeline filters out “too large” images and formulas with “grammar errors” without defining those thresholds
No support for CPU inference; you’ll need a CUDA setup from the mid-2010s or patience with legacy installs

Verdict

Worth studying if you’re writing a literature review on visual decompilation or attention-based OCR, or if you need to recover LaTeX from a corpus of formula images. Skip it if you want something production-ready; modern multimodal models handle this task with less archaeology.

Frequently asked

What is harvardnlp/im2markup?: A 2016 Harvard NLP project that treats markup recovery as an image-to-sequence translation problem, complete with attention heatmaps.
Is im2markup open source?: Yes — harvardnlp/im2markup is open source, released under the MIT license.
What language is im2markup written in?: harvardnlp/im2markup is primarily written in Lua.
How popular is im2markup?: harvardnlp/im2markup has 1.3k stars on GitHub.
Where can I find im2markup?: harvardnlp/im2markup is on GitHub at https://github.com/harvardnlp/im2markup.