Is text-extract-api open source?

Yes — CatchTheTornado/text-extract-api is open source, released under the MIT license.

What language is text-extract-api written in?

CatchTheTornado/text-extract-api is primarily written in Python.

How popular is text-extract-api?

CatchTheTornado/text-extract-api has 3.1k stars on GitHub.

Where can I find text-extract-api?

CatchTheTornado/text-extract-api is on GitHub at https://github.com/CatchTheTornado/text-extract-api.

← all repositories

CatchTheTornado/text-extract-api

Self-hosted document pipeline that reads your PDFs and forgets your PII

An open-source API that turns documents into structured text or JSON using local OCR and LLMs, with a side of privacy scrubbing.

★3.1k stars Python Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

text-extract-api is a FastAPI service that ingests PDFs, Office files, and images, then spits out Markdown or structured JSON. It runs OCR through multiple strategies—EasyOCR, MiniCPM-V, Llama 3.2 Vision, or a remote Marker server—and can hand the results to Ollama models for cleanup, formatting, or stripping out personally identifiable information. Celery handles the queue, Redis caches intermediate OCR results, and everything ships via docker-compose.

The interesting bit

The project treats OCR as a pluggable strategy rather than betting on one engine. More curiously, it uses an LLM as a post-processor to fix OCR errors—Llama corrects Llama’s own misreadings, which is either elegant recursion or a small conflict of interest. The PII removal runs through the same pipeline, so you can extract and sanitize in one pass without touching cloud APIs.

Key highlights

Ships fully local: PyTorch OCR + Ollama via docker-compose, no external data transfer
Four OCR strategies: easyocr (fast, 30+ languages), minicpm-v, llama_vision (90B parameters, “probably the slowest”), or remote Marker for difficult scripts
LLM post-processing for spelling correction and JSON structuring
Built-in PII removal with example prompts for invoices, medical reports, etc.
Redis caching for OCR results, Celery for distributed processing, pluggable storage (local, Google Drive)
CLI tool and REST API for batch or interactive use

Caveats

Docker doesn’t support Apple GPUs; Mac users need a native install with manual dependency hunting (libmagic, poppler, ghostscript, etc.)
The DISABLE_LOCAL_OLLAMA env var doesn’t work in Docker yet—requires editing compose files directly
Marker integration is deliberately excluded from the default distribution due to GPL3 licensing; you must run it as a separate service
Llama 3.2 Vision’s 90B parameter count makes it the default strategy and the slowest; plan accordingly

Verdict

Worth a look if you need document extraction in a regulated or privacy-sensitive environment where sending files to OpenAI or Google is a non-starter. Skip it if you want a one-click SaaS with zero infrastructure; the Docker-or-manual setup and Ollama model pulls are real work.

Frequently asked

What is CatchTheTornado/text-extract-api?: An open-source API that turns documents into structured text or JSON using local OCR and LLMs, with a side of privacy scrubbing.
Is text-extract-api open source?: Yes — CatchTheTornado/text-extract-api is open source, released under the MIT license.
What language is text-extract-api written in?: CatchTheTornado/text-extract-api is primarily written in Python.
How popular is text-extract-api?: CatchTheTornado/text-extract-api has 3.1k stars on GitHub.
Where can I find text-extract-api?: CatchTheTornado/text-extract-api is on GitHub at https://github.com/CatchTheTornado/text-extract-api.