IBM's 60k-star PDF parser that speaks LLM
Docling turns chaotic office documents into structured, AI-ready formats without sending your data to the cloud.

What it does Docling ingests PDFs, Word docs, PowerPoints, Excel sheets, images, audio, and even LaTeX, then exports clean structured output — Markdown, JSON, HTML, or a proprietary “DocTags” format. It runs entirely locally, which matters when your documents contain things you wouldn’t paste into ChatGPT.
The interesting bit The project treats document parsing as an AI infrastructure problem, not a file-conversion chore. It bundles layout analysis, reading-order detection, table reconstruction, OCR, and even chart understanding (bar charts to tables, pie charts to descriptions) into a single pipeline. The new default “Heron” layout model speeds up PDF parsing, and there’s a built-in MCP server so agents can call it directly.
Key highlights
- 60k+ GitHub stars; originated at IBM Research Zurich, now under the Linux Foundation’s AI & Data umbrella
- One-liner CLI:
docling https://arxiv.org/pdf/2206.01062spits out structured Markdown - Native integrations with LangChain, LlamaIndex, Crew AI, and Haystack
- Supports visual language models including IBM’s own GraniteDocling for tricky layouts
- Handles niche formats: USPTO patents, JATS academic articles, XBRL financial reports, WebVTT transcripts
Caveats
- Python 3.9 support was dropped in v2.70.0; requires 3.10+
- Structured information extraction is marked beta
- Some advanced features (metadata extraction, molecular structure parsing) are listed as “coming soon” with no timeline given
Verdict
Worth a look if you’re building RAG pipelines or agentic workflows and tired of explaining to your LLM why the table in page 47 of a PDF is actually three tables. Overkill if you just need pdftotext.