Screenshot an equation, get LaTeX back
A Vision Transformer that reads math formulas from images and spits out typeset-ready code.

What it does
pix2tex takes an image of a mathematical formula—screenshot, photo, or file—and returns the corresponding LaTeX code. It runs as a CLI tool, a GUI snipping utility, a Python API, or a Dockerized Streamlit service. The model checkpoints download automatically on first use.
The interesting bit
The preprocessing step is the quiet hero: a secondary neural network predicts the optimal resolution for your input image, resizing it to match the training distribution. The README is admirably honest that this isn’t magic—“don’t zoom in all the way before taking a picture”—and suggests retrying at different resolutions if the first prediction looks off.
Key highlights
- Encoder-decoder architecture: ViT with ResNet backbone feeding a Transformer decoder
- Token accuracy of 0.60 and BLEU score of 0.88 on the benchmark dataset
- GUI supports Linux screenshot tools across X11 and Wayland (with manual
SCREENSHOT_TOOLoverride for compositor compatibility) - Training pipeline included, with data generation via XeLaTeX and KaTeX normalization
- Handwritten formula support marked as “kinda done” in the training notebook
Caveats
- Token accuracy at 0.60 means roughly four in ten tokens are wrong; the README explicitly warns to “always double check the result carefully”
- Beam search, model distillation, and proper tracing are all on the un-checked TODO list
- Dataset class “needs further improving” per the author’s own note
Verdict
Worth a look if you regularly transcribe equations from papers or slides and can tolerate proofreading the output. Not yet a drop-in replacement for manual typing if precision matters.