Meta's PDF parser that actually reads the math
Nougat turns academic PDFs into structured markdown, including LaTeX equations and tables, using a vision transformer trained on arXiv papers.

What it does
Nougat is a neural PDF-to-markdown converter built specifically for academic documents. Feed it a paper and it spits out .mmd files—lightweight markup with LaTeX math and tables intact. It runs via CLI, Python API, or a local HTTP server on port 8503. Two model sizes exist: 0.1.0-small (default) and 0.1.0-base.
The interesting bit
Instead of treating PDFs as text extraction problems, Nougat treats them as vision problems. It builds on the Donut architecture—pure image-to-sequence, no traditional OCR pipeline. The model learned on arXiv and PubMed Central papers, so it understands two-column layouts, inline math, and the general chaos of TeX-generated PDFs.
Key highlights
- Outputs Mathpix-compatible markdown with LaTeX tables and equations preserved
- CLI supports batch processing, page ranges (
-p 1-4,7), and directory inputs - Optional API mode (
nougat_api) for HTTP POST requests with start/stop page parameters - Training and fine-tuning pipeline included via
train.pyand YAML configs - Dataset generation tools provided, though they require LaTeXML, pdffigures2, and non-trivial setup
Caveats
- English or Latin-based languages only; Chinese, Russian, Japanese, etc. will not work
- Failure detection heuristic misfires on some CPUs/GPUs, producing
[MISSING_PAGE]—use--no-skippingif this happens - Model weights are CC-BY-NC (non-commercial), while the code is MIT
Verdict
Researchers building RAG pipelines, citation tools, or anything that needs structured text from PDFs should try this. If your documents aren’t academic papers or you need commercial use of the weights, look elsewhere.