← all repositories
adithya-s-k/omniparse

A local Swiss Army knife for turning files into LLM-ready markdown

OmniParse bundles OCR, transcription, and web crawling into one self-hosted pipeline so your RAG pipeline doesn't need a dozen SaaS subscriptions.

7.5k stars Python Data ToolingRAG · Search
omniparse
Velocity · 7d
+10
★ / day
Trend
steady
star history

What it does OmniParse is a self-hosted ingestion server that chews through documents, images, audio, video, and web pages, then spits out structured markdown. It wraps existing open-source models—Marker/Surya for PDFs, Florence-2 for image tasks, Whisper for audio, and Selenium for crawling—behind a single HTTP API. The pitch is simple: feed it a file, get back something clean enough to drop straight into a vector database.

The interesting bit The project squeezes all of this onto a single T4 GPU (about 8–10 GB VRAM) by deliberately using the smallest model variants. That’s a pragmatic trade-off: it sacrifices peak accuracy for the ability to run entirely offline without API keys or egress costs. The roadmap is even more ambitious—eventually replacing the whole model zoo with one multimodal parser.

Key highlights

  • Supports ~20 file types across documents, media, and dynamic web pages
  • Runs completely local; no external API calls
  • Docker and SkyPilot deployment options, plus a Gradio UI
  • Modular server startup: load only the document, media, or web parsers you need
  • Outputs structured markdown with table extraction, image captioning, and transcription

Caveats

  • Server is Linux-only; Windows and macOS are explicitly unsupported
  • Underlying Marker/Surya models carry a cc-by-nc-sa-4.0 weight license with commercial restrictions (waived only for small orgs under $5M revenue and funding)
  • Document parsing has known rough edges: equations don’t always convert to LaTeX, tables can misalign, and non-English text (e.g., Chinese) may struggle
  • Smallest model variants mean “best-in-class performance” is explicitly not the goal

Verdict Worth a look if you’re building RAG pipelines and tired of stitching together five different services, but only if you’ve got the GPU and the Linux box to host it. Teams needing production-grade OCR accuracy or Windows deployment should probably wait—or look elsewhere.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.