← all repositories
infiniflow/ragflow

RAGFlow: 81K stars, one chunky document pipeline

An open-source RAG engine that treats document parsing as a first-class problem, not an afterthought.

82.1k stars Python RAG · SearchAgents
ragflow
Velocity · 7d
+90
★ / day
Trend
steady
star history

What it does RAGFlow is a self-hostable retrieval-augmented generation platform that ingests messy documents—PDFs, scans, slides, spreadsheets, web pages—and turns them into grounded, citable answers via LLMs. It bundles chunking, embedding, re-ranking, and now agentic workflows into one Docker-deployable system. The project also runs a managed cloud at cloud.ragflow.io.

The interesting bit Most RAG tools assume your documents are already clean text. RAGFlow leans hard into the opposite assumption: it ships “DeepDoc” parsing, template-based chunking you can inspect and tweak, and visual tracing so you can see exactly which chunk spawned which sentence of the answer. The agentic layer (workflows, memory, MCP support, even a sandboxed Python/JS code executor) turns it from a search pipe into something closer to a context-aware automation engine.

Key highlights

  • Deep document parsing for “complicated formats” including scanned copies and images within PDFs/DOCX files
  • Visual chunking with human-in-the-loop intervention and traceable citations
  • Pre-built agent templates plus orchestrable ingestion pipelines
  • Broad data source sync: Confluence, S3, Notion, Discord, Google Drive
  • Configurable LLM and embedding model backends; GPU acceleration optional for parsing
  • Apache 2.0 licensed, Python 3.13+, Docker-based deployment

Caveats

  • Docker images are x86-only; ARM64 requires building your own image
  • Minimum specs are non-trivial: 4 cores, 16 GB RAM, 50 GB disk, plus vm.max_map_count >= 262144
  • gVisor required if you want the sandboxed code executor feature

Verdict Worth a look if you’re building production RAG where document quality is the bottleneck, or if you need agentic workflows with auditability. Skip it if you want a lightweight drop-in vector search layer—this is a full-stack appliance, not a library.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.