Is MinerU open source?

Yes — opendatalab/MinerU is an open-source project tracked on heatdrop.

What language is MinerU written in?

opendatalab/MinerU is primarily written in Python.

How popular is MinerU?

opendatalab/MinerU has 75.5k stars on GitHub and is currently cooling off.

Where can I find MinerU?

opendatalab/MinerU is on GitHub at https://github.com/opendatalab/MinerU.

← all repositories

opendatalab/MinerU

The document excavator that feeds your RAG pipeline

MinerU turns PDFs, Office files, and images into structured Markdown and JSON so LLM agents don’t drown in layout noise.

★75.5k stars Python Data Tooling

View on GitHub ↗ Homepage ↗

Velocity · 7d

+106

★ / day

Trend

↘cooling

star history

What it does

MinerU is a document parsing engine that ingests PDFs, Word docs, PowerPoints, Excel sheets, images, and web pages, then emits structured Markdown or JSON tuned for LLM consumption. It handles the gnarly details—multi-column layouts, scanned pages, handwriting, cross-page tables, formulas rendered as LaTeX, and headers/footers stripped out—so downstream RAG and agent workflows get clean context instead of layout soup. The project ships with three inference backends: a fast CPU/GPU pipeline, a high-accuracy VLM engine, and a hybrid mode that tries to keep hallucinations low.

The interesting bit

The real signal is the dual VLM+OCR engine and the recent license shift from AGPLv3 to a custom Apache 2.0–based license, which lowers the barrier for commercial use. It also integrates as an MCP server for Cursor and Claude Desktop, and supports a laundry list of domestic Chinese AI chips, suggesting it was built for serious offline enterprise deployment, not just weekend hacking.

Key highlights

Native parsing for DOCX, PPTX, and XLSX alongside PDFs and images; outputs Markdown, JSON, and HTML tables.
VLM + OCR dual engine covering 109 languages, with recent upgrades for chart parsing and cross-page table merging.
Runs fully offline with three backend options: fast pipeline, VLM-driven accuracy, or a hybrid low-hallucination mode.
Plugs directly into LangChain, LlamaIndex, Dify, and FastGPT, plus an MCP server for AI coding tools.
Supports 10+ domestic AI accelerators (Ascend, Cambricon, MetaX, etc.) in addition to standard NVIDIA/CPU stacks.

Caveats

The custom “MinerU Open Source License” is Apache 2.0–based but not standard OSI Apache 2.0, so legal review is warranted before commercial embedding.
The README is heavy on feature lists and light on quantitative accuracy benchmarks or latency numbers.

Verdict

Worth evaluating if you’re building RAG or agentic workflows that ingest messy real-world documents, especially in regulated or air-gapped environments. Skip it if your documents are already clean text or you prefer to outsource parsing to a managed API.

Frequently asked

What is opendatalab/MinerU?: MinerU turns PDFs, Office files, and images into structured Markdown and JSON so LLM agents don’t drown in layout noise.
Is MinerU open source?: Yes — opendatalab/MinerU is an open-source project tracked on heatdrop.
What language is MinerU written in?: opendatalab/MinerU is primarily written in Python.
How popular is MinerU?: opendatalab/MinerU has 75.5k stars on GitHub and is currently cooling off.
Where can I find MinerU?: opendatalab/MinerU is on GitHub at https://github.com/opendatalab/MinerU.