66K stars for a PDF-to-markdown pipeline that actually reads the layout
MinerU turns messy PDFs, Office files, and images into structured markdown so your RAG pipeline stops choking on scrambled text.

What it does
MinerU is a document parser that converts PDFs, DOCX, PPTX, XLSX, images, and web pages into structured Markdown or JSON. It handles formulas as LaTeX, tables as HTML, and attempts to reconstruct the original reading order while stripping headers and footers. It also runs OCR across 109 languages for scanned documents and handwriting.
The interesting bit
The project ships three inference backends: a fast CPU/GPU “pipeline” mode, a VLM engine for higher accuracy, and a “hybrid-engine” that tries to split the difference. It also supports a laundry list of domestic Chinese AI chips (Ascend, Cambricon, Enflame, etc.) and plugs into MCP servers for Cursor/Claude Desktop, plus LangChain, Dify, and FastGPT. The recent license switch from AGPLv3 to a custom Apache-2.0-based license is a notable attempt to reduce commercial adoption friction.
Key highlights
- Native parsing for DOCX, PPTX, and XLSX (not just PDF)
- VLM + OCR dual engine with 109-language support
- Three inference backends: pipeline, vlm-engine, hybrid-engine
- MCP server + SDKs in Python, Go, TypeScript, plus CLI, REST API, Docker
- Supports 10+ domestic AI chips for offline deployment
Caveats
- The “state-of-the-art” accuracy claim for the new VLM model is stated but not quantified in the README
- The custom “MinerU Open Source License” is Apache-2.0-based, but the exact modifications are not summarized; you’ll need to read the full text
- Heavy emphasis on Chinese hardware ecosystems may mean rougher edges on standard NVIDIA/AMD setups
Verdict
Worth a look if you’re building RAG or agent pipelines and need more than raw text extraction from complex documents. Skip it if your PDFs are already clean and single-column, or if you can’t stomach parsing a custom license.