← all repositories
docling-project/docling

IBM's 60k-star PDF parser that speaks LLM

Docling turns chaotic office documents into structured, AI-ready formats without sending your data to the cloud.

61.1k stars Python Data ToolingRAG · Search
docling
Velocity · 7d
+87
★ / day
Trend
steady
star history

What it does Docling ingests PDFs, Word docs, PowerPoints, Excel sheets, images, audio, and even LaTeX, then exports clean structured output — Markdown, JSON, HTML, or a proprietary “DocTags” format. It runs entirely locally, which matters when your documents contain things you wouldn’t paste into ChatGPT.

The interesting bit The project treats document parsing as an AI infrastructure problem, not a file-conversion chore. It bundles layout analysis, reading-order detection, table reconstruction, OCR, and even chart understanding (bar charts to tables, pie charts to descriptions) into a single pipeline. The new default “Heron” layout model speeds up PDF parsing, and there’s a built-in MCP server so agents can call it directly.

Key highlights

  • 60k+ GitHub stars; originated at IBM Research Zurich, now under the Linux Foundation’s AI & Data umbrella
  • One-liner CLI: docling https://arxiv.org/pdf/2206.01062 spits out structured Markdown
  • Native integrations with LangChain, LlamaIndex, Crew AI, and Haystack
  • Supports visual language models including IBM’s own GraniteDocling for tricky layouts
  • Handles niche formats: USPTO patents, JATS academic articles, XBRL financial reports, WebVTT transcripts

Caveats

  • Python 3.9 support was dropped in v2.70.0; requires 3.10+
  • Structured information extraction is marked beta
  • Some advanced features (metadata extraction, molecular structure parsing) are listed as “coming soon” with no timeline given

Verdict Worth a look if you’re building RAG pipelines or agentic workflows and tired of explaining to your LLM why the table in page 47 of a PDF is actually three tables. Overkill if you just need pdftotext.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.