← all repositories
opendatalab/MinerU

66K stars for a PDF-to-markdown pipeline that actually reads the layout

MinerU turns messy PDFs, Office files, and images into structured markdown so your RAG pipeline stops choking on scrambled text.

66.8k stars Python Data Tooling
MinerU
Velocity · 7d
+80
★ / day
Trend
steady
star history

What it does

MinerU is a document parser that converts PDFs, DOCX, PPTX, XLSX, images, and web pages into structured Markdown or JSON. It handles formulas as LaTeX, tables as HTML, and attempts to reconstruct the original reading order while stripping headers and footers. It also runs OCR across 109 languages for scanned documents and handwriting.

The interesting bit

The project ships three inference backends: a fast CPU/GPU “pipeline” mode, a VLM engine for higher accuracy, and a “hybrid-engine” that tries to split the difference. It also supports a laundry list of domestic Chinese AI chips (Ascend, Cambricon, Enflame, etc.) and plugs into MCP servers for Cursor/Claude Desktop, plus LangChain, Dify, and FastGPT. The recent license switch from AGPLv3 to a custom Apache-2.0-based license is a notable attempt to reduce commercial adoption friction.

Key highlights

  • Native parsing for DOCX, PPTX, and XLSX (not just PDF)
  • VLM + OCR dual engine with 109-language support
  • Three inference backends: pipeline, vlm-engine, hybrid-engine
  • MCP server + SDKs in Python, Go, TypeScript, plus CLI, REST API, Docker
  • Supports 10+ domestic AI chips for offline deployment

Caveats

  • The “state-of-the-art” accuracy claim for the new VLM model is stated but not quantified in the README
  • The custom “MinerU Open Source License” is Apache-2.0-based, but the exact modifications are not summarized; you’ll need to read the full text
  • Heavy emphasis on Chinese hardware ecosystems may mean rougher edges on standard NVIDIA/AMD setups

Verdict

Worth a look if you’re building RAG or agent pipelines and need more than raw text extraction from complex documents. Skip it if your PDFs are already clean and single-column, or if you can’t stomach parsing a custom license.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.