Is MegaParse open source?

Yes — QuivrHQ/MegaParse is open source, released under the Apache-2.0 license.

What language is MegaParse written in?

QuivrHQ/MegaParse is primarily written in Python.

How popular is MegaParse?

QuivrHQ/MegaParse has 7.4k stars on GitHub.

Where can I find MegaParse?

QuivrHQ/MegaParse is on GitHub at https://github.com/QuivrHQ/MegaParse.

← all repositories

QuivrHQ/MegaParse

Parsing documents without the usual carnage

A document parser that actually tries to keep your tables, headers, and images intact before feeding them to an LLM.

★7.4k stars Python Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

MegaParse extracts content from PDFs, Word docs, PowerPoints, and Excel/CSV files, returning structured output meant for LLM ingestion. It handles tables, TOCs, headers, footers, and images rather than flattening everything into a text soup. There’s a standard mode and a “Vision” mode that routes documents through multimodal models (GPT-4o, Claude 3.5/4) for parsing.

The interesting bit

The project ships with a benchmark comparing similarity ratios against other parsers, and its vision-based approach scores 0.87 versus 0.33 for llama_parser and 0.59 for unstructured. That’s a meaningful gap if you actually need your document structure to survive. The modular “checker” postprocessing pipeline is still being built out, but the direction is toward pluggable validation rather than one-shot extraction.

Key highlights

Supports PDF, Word, PowerPoint, Excel, CSV, and plain text
Preserves tables, images, headers, footers, and table of contents
Vision mode uses multimodal LLMs (GPT-4o, Claude 3.5/4) for higher-fidelity extraction
Includes FastAPI server mode via make dev
Benchmark suite is extensible; PRs welcome for new parser configs
Requires Python ≥3.11, plus poppler, tesseract, and libmagic (macOS)

Caveats

Vision mode requires OpenAI or Anthropic API keys; not self-contained
Several system dependencies (poppler, tesseract) needed before pip install gets you anywhere
“In Construction” section notes table checker improvements and structured output are unfinished

Verdict

Worth a look if you’re building RAG pipelines and tired of watching your document structure get mangled. Skip it if you need a fully offline, zero-dependency parser or if you’re not ready to feed documents to third-party multimodal APIs.

Frequently asked

What is QuivrHQ/MegaParse?: A document parser that actually tries to keep your tables, headers, and images intact before feeding them to an LLM.
Is MegaParse open source?: Yes — QuivrHQ/MegaParse is open source, released under the Apache-2.0 license.
What language is MegaParse written in?: QuivrHQ/MegaParse is primarily written in Python.
How popular is MegaParse?: QuivrHQ/MegaParse has 7.4k stars on GitHub.
Where can I find MegaParse?: QuivrHQ/MegaParse is on GitHub at https://github.com/QuivrHQ/MegaParse.