← all repositories
Zipstack/unstract

Your PDFs have structure, you just need to ask nicely

Unstract turns document extraction into a prompt-and-deploy workflow instead of a regex archaeology dig.

unstract
Velocity · 7d
+7.9
★ / day
Trend
steady
star history

What it does Unstract is a self-hostable platform that feeds documents — PDFs, scans, spreadsheets, images — to LLMs and returns structured JSON. You describe what you want in natural language via a “Prompt Studio,” then expose the result as a REST API, an ETL pipeline, or an n8n node. The stack is familiar: React frontend, Django backend, Celery workers, PostgreSQL, Redis, RabbitMQ, all wrapped in Docker Compose.

The interesting bit The bet here is that prompt engineering replaces template engineering. Rather than maintaining brittle regexes per vendor or document type, you write a schema description once and let the LLM handle layout variations. The README’s “Current State vs. Unstract” table is unusually honest about this trade-off — it knows you’re currently suffering through “regex, build templates per vendor.”

Key highlights

  • Broad format support: PDF, DOCX, XLSX, PPTX, and common image formats
  • Pluggable LLM providers: OpenAI, Anthropic, Bedrock, Gemini, Ollama, Mistral, plus “OpenAI Compatible” catch-all
  • Vector DB adapters: Qdrant, Pinecone, Weaviate, Milvus, PostgreSQL
  • ETL sources and destinations include S3, GCS, Azure Blob, Snowflake, BigQuery, Redshift, and major SQL databases
  • MCP server for agent integration (Claude, etc.) and an n8n custom node
  • One-script local deploy: ./run-platform.sh with default credentials unstract / unstract

Caveats

  • Requires 8 GB RAM minimum and Docker; not a lightweight sidecar
  • The encryption key warning is worth heeding: lose ENCRYPTION_KEY and your adapter credentials are gone
  • Enterprise features (dual-LLM verification, human-in-the-loop, SSO) are cloud-only; the open-source build is the extraction engine without the guardrails

Verdict Worth a spin if you’re currently maintaining a graveyard of per-vendor document parsers and want to consolidate on LLM prompts. Skip it if your documents are already clean, your volumes are tiny, or you treat 8 GB RAM as extravagant.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.