← all repositories
allenai/olmocr

A 7B model that reads PDFs like a human, not a copy machine

AllenAI's olmOCR turns scanned documents into clean Markdown by treating layout as a vision problem, not a pipeline of heuristics.

17.4k stars Python Data Tooling
olmocr
Velocity · 7d
+28
★ / day
Trend
steady
star history

What it does

olmOCR converts PDFs, PNGs, and JPEGs into linear Markdown text. It handles equations, tables, handwriting, multi-column layouts, and figures while stripping headers and footers. The output preserves natural reading order rather than raw top-to-bottom text extraction.

The interesting bit

Instead of chaining classical OCR with layout analysis, olmOCR feeds document images through a 7B vision-language model trained to emit structured text directly. The project also ships olmOCR-Bench — 7,000 test cases across 1,400 documents — which makes the “works on my PDF” problem measurable rather than anecdotal.

Key highlights

  • Local GPU inference (RTX 4090 or better, 12GB+ VRAM) or remote via any OpenAI-compatible API endpoint
  • Remote install avoids PyTorch entirely (~2GB+ saved); GPU install needs 30GB disk space
  • Docker images available; switches from SGLang to vLLM inference as of v0.1.75
  • Claims sub-$200/million pages cost at external providers; FP8 quantization for faster inference
  • Trainer code included if you want to fine-tune the VLM on your own document distributions

Caveats

  • Requires a clean conda environment; README warns against installing into existing Python environments due to dependency conflicts
  • “Recent NVIDIA GPU” is a hard requirement for local use — no CPU fallback mentioned
  • Benchmark table shows olmOCR v0.4.0 trailing Chandra OCR 0.1.0 and Infinity-Parser 7B on overall score (82.4 vs. 83.1 and 82.5), though the gap is within error margins

Verdict

Grab this if you’re building LLM training pipelines and need to extract usable text from messy academic papers, scanned books, or complex reports. Skip it if your PDFs are already born-digital with embedded text — pdftotext is still free.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.