Is pymupdf4llm open source?

Yes — pymupdf/pymupdf4llm is open source, released under the AGPL-3.0 license.

What language is pymupdf4llm written in?

pymupdf/pymupdf4llm is primarily written in Python.

How popular is pymupdf4llm?

pymupdf/pymupdf4llm has 2k stars on GitHub.

Where can I find pymupdf4llm?

pymupdf/pymupdf4llm is on GitHub at https://github.com/pymupdf/pymupdf4llm.

← all repositories

pymupdf/pymupdf4llm

PDF-to-LLM pipeline that actually reads the room

A thin Python wrapper around PyMuPDF that turns documents into structured Markdown, JSON, or plain text—layout-aware, with selective OCR that skips clean pages.

★2k stars Python Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

PyMuPDF4LLM is a convenience layer over the venerable PyMuPDF C engine. Feed it a PDF (or Office docs via the paid Pro add-on) and it emits Markdown, JSON, or plain text with reading-order reconstruction, table detection, and heading hierarchy intact. The target audience is obvious: RAG pipelines, vector stores, and anything else that needs LLM-ready text without calling a cloud API.

The interesting bit

The hybrid OCR strategy is the standout. Instead of blanket-OCRing every page or ignoring scanned regions entirely, it inspects each page first—checking for illegible characters, vector graphics masquerading as text, image-covered areas, and existing OCR layers—then applies OCR only where needed. The README claims this cuts OCR time by roughly 50% versus full-document approaches, and avoids degrading clean digital text with recognition errors. It auto-selects between Tesseract and rapidocr_onnxruntime at runtime, or accepts a custom OCR function.

Key highlights

Three output formats from one import: to_markdown(), to_json(), to_text()
Layout-aware extraction: multi-column reconstruction, table-to-Markdown, header/footer stripping, font-size-based heading detection
Page chunking mode returns per-page dicts with metadata, bounding boxes, and TOC items—ready for vector stores
Drop-in integrations for LlamaIndex (LlamaMarkdownReader) and LangChain (PyMuPDFLoader, MarkdownTextSplitter)
Runs fully offline; no GPU, no tokens, no cloud bill

Caveats

The “10–250× cheaper than vision-based LLM extraction” claim is in the README but unsourced—treat as marketing unless you verify against your own pipeline costs
Office document support (Word, Excel, PowerPoint, HWP) requires the separate PyMuPDF Pro package, which is not open source
Legacy layout mode exists (use_layout(False)) with different header detection behavior; the README doesn’t clarify when you’d still need it

Verdict

Worth a look if you’re building document ingestion for RAG and want to stay local. Skip it if you need deep PDF editing or already have a mature extraction stack you’re happy with.

Frequently asked

What is pymupdf/pymupdf4llm?: A thin Python wrapper around PyMuPDF that turns documents into structured Markdown, JSON, or plain text—layout-aware, with selective OCR that skips clean pages.
Is pymupdf4llm open source?: Yes — pymupdf/pymupdf4llm is open source, released under the AGPL-3.0 license.
What language is pymupdf4llm written in?: pymupdf/pymupdf4llm is primarily written in Python.
How popular is pymupdf4llm?: pymupdf/pymupdf4llm has 2k stars on GitHub.
Where can I find pymupdf4llm?: pymupdf/pymupdf4llm is on GitHub at https://github.com/pymupdf/pymupdf4llm.