← all repositories
QuivrHQ/MegaParse

Parsing documents without the usual carnage

A document parser that actually tries to keep your tables, headers, and images intact before feeding them to an LLM.

7.4k stars Python Data Tooling
MegaParse
Velocity · 7d
+10.0
★ / day
Trend
steady
star history

What it does

MegaParse extracts content from PDFs, Word docs, PowerPoints, and Excel/CSV files, returning structured output meant for LLM ingestion. It handles tables, TOCs, headers, footers, and images rather than flattening everything into a text soup. There’s a standard mode and a “Vision” mode that routes documents through multimodal models (GPT-4o, Claude 3.5/4) for parsing.

The interesting bit

The project ships with a benchmark comparing similarity ratios against other parsers, and its vision-based approach scores 0.87 versus 0.33 for llama_parser and 0.59 for unstructured. That’s a meaningful gap if you actually need your document structure to survive. The modular “checker” postprocessing pipeline is still being built out, but the direction is toward pluggable validation rather than one-shot extraction.

Key highlights

  • Supports PDF, Word, PowerPoint, Excel, CSV, and plain text
  • Preserves tables, images, headers, footers, and table of contents
  • Vision mode uses multimodal LLMs (GPT-4o, Claude 3.5/4) for higher-fidelity extraction
  • Includes FastAPI server mode via make dev
  • Benchmark suite is extensible; PRs welcome for new parser configs
  • Requires Python ≥3.11, plus poppler, tesseract, and libmagic (macOS)

Caveats

  • Vision mode requires OpenAI or Anthropic API keys; not self-contained
  • Several system dependencies (poppler, tesseract) needed before pip install gets you anywhere
  • “In Construction” section notes table checker improvements and structured output are unfinished

Verdict

Worth a look if you’re building RAG pipelines and tired of watching your document structure get mangled. Skip it if you need a fully offline, zero-dependency parser or if you’re not ready to feed documents to third-party multimodal APIs.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.