← all repositories
morphik-org/morphik-core

RAG that actually reads the diagrams

Morphik is an end-to-end document store and search system built for visually rich, multimodal documents—because your PDFs contain more than text.

3.6k stars Python RAG · SearchData Tooling
morphik-core
Velocity · 7d
+6.3
★ / day
Trend
steady
star history

What it does Morphik ingests, stores, and searches unstructured documents—PDFs, images, videos—through a single Python SDK or REST API. It handles the full pipeline from ingestion to query, including metadata extraction and integrations with Google Suite, Slack, and Confluence. You can self-host via Docker or use their managed cloud tier.

The interesting bit The project leans on ColPali for multimodal search, meaning it indexes visual content directly rather than flattening diagrams and charts into broken text fragments. The README is refreshingly blunt about why traditional RAG fails: “Charts become meaningless text fragments. Critical diagrams lose their spatial relationships.” Morphik treats the visual layer as first-class data, not an afterthought.

Key highlights

  • Multimodal search via ColPali across images, PDFs, and video through one endpoint
  • Rules-based metadata extraction with bounding boxes and classification
  • MCP (Model Context Protocol) support for LLM tool integration
  • Business Source License 1.1: free for personal/indie use, free commercial use below $2K/month revenue, Apache 2.0 after four years per release
  • Self-hosted deployments available but explicitly “not fully supported” by the team

Caveats

  • The project pushes hard toward the managed cloud service; self-hosters get installation guides and Discord, but the README warns of limited support resources
  • A June 2025 auth migration is required for existing self-hosted installs to avoid performance degradation

Verdict Worth evaluating if you’re building RAG over technical documentation, manuals, or any visually dense corpus where diagram understanding matters. Skip it if you need a fully community-supported, pure-open-source vector database without commercial nudges.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.