A PDF parser that actually cares about reading order
Open-source tool extracts structured data from PDFs and auto-tags them for accessibility, backed by benchmark claims and PDF Association collaboration.

What it does OpenDataLoader PDF turns PDFs into structured Markdown, JSON with bounding boxes, or HTML. It also auto-tags untagged PDFs into Tagged PDFs — the foundation for screen-reader accessibility. There’s a fast deterministic mode for simple documents and a “hybrid” AI-backed mode for scanned pages, complex tables, formulas, and charts. SDKs exist for Python, Node.js, and Java, though the core engine runs on the JVM.
The interesting bit The project claims to be the first open-source end-to-end tool for auto-tagging PDFs to the Well-Tagged PDF specification, validated with veraPDF in collaboration with the PDF Association and Dual Lab. That’s the unglamorous infrastructure work that usually costs $50–200 per document in manual remediation.
Key highlights
- Benchmark claims: #1 overall extraction accuracy (0.907) in hybrid mode, 0.928 table accuracy, per their own benchmark suite across 200 real-world PDFs
- Free tier covers data extraction, layout analysis, and auto-tagging to Tagged PDF under Apache 2.0
- Enterprise add-on adds PDF/UA-1/2 export and a visual accessibility studio
- Hybrid mode requires running a local server (
opendataloader-pdf-hybrid) alongside the main client - Each
convert()call spawns a JVM process — batch your files in one call or pay the startup cost repeatedly
Caveats
- The benchmark table shows their own local mode is dramatically faster (0.015s/page) but drops to 0.831 overall accuracy — the #1 score requires hybrid mode at 0.463s/page
- “Enterprise” features (PDF/UA export, accessibility studio) are proprietary add-ons; the open-source boundary stops at Tagged PDF generation
- No Word, Excel, or PowerPoint support — PDFs only
Verdict Worth evaluating if you’re building RAG pipelines or need to batch-process PDFs for accessibility compliance. Skip if you need sub-100ms per page or a pure-Python stack without JVM baggage.