A RAG stack that actually stays on your laptop
LocalGPT wires Ollama, LanceDB, and a smart query router into a private document-chat system.

What it does LocalGPT is a self-hosted document-QA stack. You upload files, it builds a search index, and you chat with the contents through a web UI or REST API. Everything runs locally via Ollama; no API keys, no data egress. The system is split into four services—Ollama, a RAG API, a backend server, and a React frontend—managed by a single Python launcher.
The interesting bit The RAG pipeline is more opinionated than most. It mixes semantic search, BM25 keyword matching, and “Late Chunking” for long-context embeddings, then routes each query to either RAG or direct LLM answering based on some internal logic. There’s also a verification pass and sentence-level context pruning. Whether this complexity beats a simpler setup is left as an exercise to the user.
Key highlights
- Pure-Python RAG core with LanceDB for vectors
- Supports CUDA, CPU, Intel Gaudi (HPU), and Apple Silicon (MPS)
- Pluggable models via Ollama; defaults to Qwen3 family for generation and embeddings
- Semantic caching with TTL to avoid repeated similar queries
- Session-aware chat history and source attribution on answers
- Docker and bare-metal install paths, plus a
--no-frontendAPI-only mode
Caveats
- Installation is currently only tested on macOS; Windows and Linux paths exist in docs but are untested
- “Multi-format support” currently means PDF only—DOCX, TXT, and Markdown are listed but not working yet
- The v2 branch is the one to clone; main is behind, which suggests the project is mid-transition
Verdict Worth a look if you need a fully air-gapped RAG setup and don’t mind some assembly. Skip it if you want battle-tested cross-platform stability or production-grade document format support.