Self-hosted document pipeline that reads your PDFs and forgets your PII
An open-source API that turns documents into structured text or JSON using local OCR and LLMs, with a side of privacy scrubbing.

What it does
text-extract-api is a FastAPI service that ingests PDFs, Office files, and images, then spits out Markdown or structured JSON. It runs OCR through multiple strategies—EasyOCR, MiniCPM-V, Llama 3.2 Vision, or a remote Marker server—and can hand the results to Ollama models for cleanup, formatting, or stripping out personally identifiable information. Celery handles the queue, Redis caches intermediate OCR results, and everything ships via docker-compose.
The interesting bit
The project treats OCR as a pluggable strategy rather than betting on one engine. More curiously, it uses an LLM as a post-processor to fix OCR errors—Llama corrects Llama’s own misreadings, which is either elegant recursion or a small conflict of interest. The PII removal runs through the same pipeline, so you can extract and sanitize in one pass without touching cloud APIs.
Key highlights
- Ships fully local: PyTorch OCR + Ollama via docker-compose, no external data transfer
- Four OCR strategies: easyocr (fast, 30+ languages), minicpm-v, llama_vision (90B parameters, “probably the slowest”), or remote Marker for difficult scripts
- LLM post-processing for spelling correction and JSON structuring
- Built-in PII removal with example prompts for invoices, medical reports, etc.
- Redis caching for OCR results, Celery for distributed processing, pluggable storage (local, Google Drive)
- CLI tool and REST API for batch or interactive use
Caveats
- Docker doesn’t support Apple GPUs; Mac users need a native install with manual dependency hunting (libmagic, poppler, ghostscript, etc.)
- The
DISABLE_LOCAL_OLLAMAenv var doesn’t work in Docker yet—requires editing compose files directly - Marker integration is deliberately excluded from the default distribution due to GPL3 licensing; you must run it as a separate service
- Llama 3.2 Vision’s 90B parameter count makes it the default strategy and the slowest; plan accordingly
Verdict
Worth a look if you need document extraction in a regulated or privacy-sensitive environment where sending files to OpenAI or Google is a non-starter. Skip it if you want a one-click SaaS with zero infrastructure; the Docker-or-manual setup and Ollama model pulls are real work.