← all repositories
Ontos-AI/knowhere

Document parsing that remembers where it put things

A pipeline that turns messy PDFs and slides into structured, navigable memory for AI agents instead of flat text shards.

1.1k stars Python RAG · SearchData Tooling
knowhere
Velocity · 7d
+27
★ / day
Trend
steady
star history

What it does

Knowhere ingests unstructured documents—PDFs, Office files, images, tables, Markdown—and outputs structured chunks with full hierarchy intact. It stores them as a navigable memory graph that agents can query, traverse, and cite. The project is the backend API and worker; a separate dashboard repo provides the web UI.

The interesting bit

Most chunking flattens documents into semantic confetti. Knowhere uses a tree-like algorithm to reconstruct document hierarchy, then links chunks across documents into a lightweight graph. Agents don’t just retrieve nearest-neighbor vectors; they walk section trees and cross-document relationships the way a human skims a table of contents.

Key highlights

  • Multi-modal parsing with VLM summaries for images and tables, linked back to source chunks
  • Hybrid retrieval fusing keyword, path, content, and semantic signals, plus agentic navigation
  • Model-agnostic LLM/VLM support: defaults to DeepSeek and Qwen-VL, swappable via env vars
  • Uses MinerU as default parser but explicitly decouples from it—any Markdown-outputting tool works
  • Apache 2.0, with Docker Compose self-hosting option and managed cloud API at knowhereto.ai

Caveats

  • Requires a small constellation of external services: PostgreSQL, Redis, S3-compatible storage, and at least one LLM API key to do much of anything
  • Benchmarks shown are internal evaluations with limited public detail; the README notes they’re “actively expanding”
  • Several formats (EPUB, HTML, audio, video, and a curious .skills.md) are listed as “coming soon”

Verdict

Worth a look if you’re building agentic RAG and tired of watching your LLM lose the plot across flattened document chunks. Skip it if you need a quick, zero-dependency document parser—this is memory infrastructure, not a drop-in replacement for MinerU or Markitdown.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.