A 3B model that actually reads your PDFs, not just scrapes them
OCRFlux converts messy PDFs and images into clean Markdown by treating layout as a visual reasoning problem, not a pipeline of brittle heuristics.

What it does
OCRFlux is a Python toolkit that turns PDFs and images into structured Markdown. It handles multi-column layouts, figures, equations, and tables—then stitches together paragraphs and tables that span across page breaks. The authors claim this cross-page merging is the first open-source implementation of its kind.
Under the hood it runs a 3B-parameter vision-language model through vLLM, so inference needs a recent NVIDIA GPU with at least 12 GB of VRAM.
The interesting bit
Most OCR tools treat each page as an isolated image and pray the layout is simple. OCRFlux’s unusual angle is explicitly modeling cross-page structure: it detects when a table or paragraph continues on the next page, then merges fragments even when headers repeat or cells split mid-row. The README documents genuinely gnarly cases—vertical table splits, multi-line cells broken across pages—that most pipelines simply garble.
Key highlights
- 3B-parameter VLM runs on a GTX 3090; no 70B model required
- Benchmarks against olmOCR-7B, Nanonets-OCR-s, and MonkeyOCR on manually labeled English and Chinese data
- Claims 0.967 average Edit Distance Similarity on single-page parsing versus 0.872 for olmOCR-7B
- Cross-page table/paragraph detection scores 0.986 F1 on held-out test data
- Ships four evaluation datasets on Hugging Face, including a 9K-sample table-merging benchmark
Caveats
- Complex tables (rowspan/colspan) underperform simpler ones: 0.807 TEDS vs. 0.912 on the PubTabNet-derived benchmark, and behind MonkeyOCR on that specific split
- Installation is finicky: requires poppler-utils, specific Microsoft and Crosextra fonts, and a clean conda environment; the README warns against installing into existing Python environments
- Only launched June 2025; long-term maintenance trajectory unclear
Verdict
Worth a look if you regularly ingest academic papers, financial reports, or scanned documents where table continuity matters. Skip it if you need CPU-only inference or your PDFs are already clean single-page images.