RAGFlow: 81K stars, one chunky document pipeline
An open-source RAG engine that treats document parsing as a first-class problem, not an afterthought.

What it does RAGFlow is a self-hostable retrieval-augmented generation platform that ingests messy documents—PDFs, scans, slides, spreadsheets, web pages—and turns them into grounded, citable answers via LLMs. It bundles chunking, embedding, re-ranking, and now agentic workflows into one Docker-deployable system. The project also runs a managed cloud at cloud.ragflow.io.
The interesting bit Most RAG tools assume your documents are already clean text. RAGFlow leans hard into the opposite assumption: it ships “DeepDoc” parsing, template-based chunking you can inspect and tweak, and visual tracing so you can see exactly which chunk spawned which sentence of the answer. The agentic layer (workflows, memory, MCP support, even a sandboxed Python/JS code executor) turns it from a search pipe into something closer to a context-aware automation engine.
Key highlights
- Deep document parsing for “complicated formats” including scanned copies and images within PDFs/DOCX files
- Visual chunking with human-in-the-loop intervention and traceable citations
- Pre-built agent templates plus orchestrable ingestion pipelines
- Broad data source sync: Confluence, S3, Notion, Discord, Google Drive
- Configurable LLM and embedding model backends; GPU acceleration optional for parsing
- Apache 2.0 licensed, Python 3.13+, Docker-based deployment
Caveats
- Docker images are x86-only; ARM64 requires building your own image
- Minimum specs are non-trivial: 4 cores, 16 GB RAM, 50 GB disk, plus
vm.max_map_count >= 262144 - gVisor required if you want the sandboxed code executor feature
Verdict Worth a look if you’re building production RAG where document quality is the bottleneck, or if you need agentic workflows with auditability. Skip it if you want a lightweight drop-in vector search layer—this is a full-stack appliance, not a library.