A 7B model that reads PDFs like a human, not a copy machine
AllenAI's olmOCR turns scanned documents into clean Markdown by treating layout as a vision problem, not a pipeline of heuristics.

What it does
olmOCR converts PDFs, PNGs, and JPEGs into linear Markdown text. It handles equations, tables, handwriting, multi-column layouts, and figures while stripping headers and footers. The output preserves natural reading order rather than raw top-to-bottom text extraction.
The interesting bit
Instead of chaining classical OCR with layout analysis, olmOCR feeds document images through a 7B vision-language model trained to emit structured text directly. The project also ships olmOCR-Bench — 7,000 test cases across 1,400 documents — which makes the “works on my PDF” problem measurable rather than anecdotal.
Key highlights
- Local GPU inference (RTX 4090 or better, 12GB+ VRAM) or remote via any OpenAI-compatible API endpoint
- Remote install avoids PyTorch entirely (~2GB+ saved); GPU install needs 30GB disk space
- Docker images available; switches from SGLang to vLLM inference as of v0.1.75
- Claims sub-$200/million pages cost at external providers; FP8 quantization for faster inference
- Trainer code included if you want to fine-tune the VLM on your own document distributions
Caveats
- Requires a clean conda environment; README warns against installing into existing Python environments due to dependency conflicts
- “Recent NVIDIA GPU” is a hard requirement for local use — no CPU fallback mentioned
- Benchmark table shows olmOCR v0.4.0 trailing Chandra OCR 0.1.0 and Infinity-Parser 7B on overall score (82.4 vs. 83.1 and 82.5), though the gap is within error margins
Verdict
Grab this if you’re building LLM training pipelines and need to extract usable text from messy academic papers, scanned books, or complex reports. Skip it if your PDFs are already born-digital with embedded text — pdftotext is still free.