← all repositories
Unstructured-IO/unstructured

The messy document-to-LLM pipeline, bottled

A Python library that chews up PDFs, Word docs, and images, then spits out structured data your LLM can actually digest.

14.8k stars HTML Data ToolingRAG · Search
unstructured
Velocity · 7d
+11
★ / day
Trend
steady
star history

What it does

unstructured is an open-source Python toolkit for ingesting and pre-processing documents—PDFs, HTML, Word files, images, and more—into clean, structured outputs. It’s pitched squarely at the LLM ecosystem: extract text, preserve layout, and feed the result into your RAG pipeline or embedding model without writing yet another ad-hoc parser.

The interesting bit

The library leans heavily on modular “partition” functions (partition_pdf, partition_text, etc.) that you can call individually or through a single partition_auto entry point. The Docker setup is multi-arch (x86_64 and Apple Silicon), and the build system uses uv for dependency management—small conveniences that suggest the maintainers actually run this in production, not just demo it.

Key highlights

  • Supports a wide document surface: PDFs, Office docs, HTML, XML, JSON, email, images
  • Optional extras for specific formats (pip install "unstructured[pdf,docx]") so you don’t drag in Tesseract and LibreOffice unless you need them
  • Docker images tagged per commit and version, with latest for quick pulls
  • Uses uv for local development; one make install pulls all extras, dev, test, and lint dependencies
  • Enterprise “Platform” product sold separately with chunking, embedding, and table enrichment

Caveats

  • System dependencies stack up fast: poppler-utils, tesseract-ocr, libreoffice, libmagic-dev—this is not a pure-Pip install for most real use cases
  • The README warns that local Docker builds can break when the wolfi-base image updates upstream
  • Heavy overlap with LangChain document loaders; unclear when to use this versus that

Verdict

Worth a look if you’re building a document ingestion pipeline and tired of gluing together pdfplumber, python-docx, and BeautifulSoup. Skip it if your inputs are already clean text or if you’re happy with your current parser stack.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.