← all repositories
VectifyAI/PageIndex

RAG without the vector tax: PageIndex makes LLMs browse documents like humans

A tree-structured index replaces chunking and similarity search with reasoning-based navigation for long professional documents.

32.7k stars Python RAG · SearchAgents
PageIndex
Velocity · 7d
+76
★ / day
Trend
steady
star history

What it does PageIndex builds a hierarchical “table of contents” tree from PDFs or Markdown, then lets an LLM reason its way through that tree to find relevant sections. No vector database, no fixed-size chunking, no embedding search. The open-source repo handles basic PDF parsing; a cloud service with enhanced OCR and retrieval is also available via API or MCP.

The interesting bit The project explicitly rejects “vibe retrieval” — their term for opaque vector similarity — in favor of traceable, explainable tree search. They claim 98.7% accuracy on FinanceBench, a financial-document QA benchmark, though the README doesn’t detail how that number was measured or what baselines were beaten. The AlphaGo analogy is a stretch, but the core idea is sound: humans navigate documents by structure and inference, not by comparing embedding cosine similarities.

Key highlights

  • Self-hosted Python pipeline: run_pageindex.py generates trees from PDFs or Markdown with configurable node sizes and LLM backends (via LiteLLM)
  • Agentic RAG example using OpenAI Agents SDK for end-to-end reasoning workflows
  • Vision-based RAG notebook works directly on page images, skipping OCR entirely
  • Cloud service adds enhanced OCR, better tree building, and a chat platform; enterprise on-prem available
  • Markdown mode included, though the README warns against using it on PDF-converted Markdown (hierarchy gets mangled)

Caveats

  • The open-source version uses “standard PDF parsing” — the README itself nudges you toward their cloud service for “complex PDFs”
  • The 98.7% FinanceBench claim is sourced to another repo and a blog post, not reproduced in this repo’s code or docs
  • Tree generation is LLM-dependent and likely slower than vector lookup; cost and latency tradeoffs are unexplored in the README

Verdict Worth a look if you’re hitting accuracy walls with vector RAG on long, structured documents like financial reports or legal filings. Skip it if your documents are short, unstructured, or your latency budget is tight — this trades speed for reasoning depth, and the reasoning isn’t free.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.