Is unstructured open source?

Yes — Unstructured-IO/unstructured is open source, released under the Apache-2.0 license.

What language is unstructured written in?

Unstructured-IO/unstructured is primarily written in HTML.

How popular is unstructured?

Unstructured-IO/unstructured has 15.2k stars on GitHub and is currently holding steady.

Where can I find unstructured?

Unstructured-IO/unstructured is on GitHub at https://github.com/Unstructured-IO/unstructured.

← all repositories

Unstructured-IO/unstructured

Pre-processing documents so LLMs don't choke

The library exists to transform messy documents—PDFs, Word files, HTML, images—into clean, structured data that language models can actually ingest.

★15.2k stars HTML Data Tooling RAG · Search

View on GitHub ↗ Homepage ↗

Velocity · 7d

+6.9

★ / day

Trend

→steady

star history

What it does

unstructured is an open-source Python library that ingests and pre-processes documents—PDFs, HTML, Word files, images, and others—and converts them into structured outputs. It acts as an ETL layer between raw document chaos and LLM pipelines, using modular connectors and format-specific partitioners to extract usable text and structure. The goal is to spare you from writing one-off parsers for every file type your RAG pipeline encounters.

The interesting bit

Instead of treating document parsing as a single monolithic extraction, the library breaks the process into modular functions like partition_pdf and partition_text, plus an auto-detection router that picks the right parser for the job. It is largely plumbing: it orchestrates external tools such as Tesseract, Poppler, and LibreOffice behind a unified Python interface so you don’t have to wrangle those system dependencies yourself.

Key highlights

Supports many document types, from PDF and DOCX to HTML and email, via optional extras so you install only what you need.
Provides both specific partitioners (partition_pdf, partition_text) and an auto-detection entry point (partition) that infers the correct parser from the file.
Ships multi-platform Docker images for x86_64 and Apple silicon with heavy dependencies pre-installed.
Uses uv for dependency management and offers granular extras, meaning you can avoid dragging in OCR libraries if you only need plain-text parsing.

Caveats

System dependencies such as Tesseract, Poppler, and LibreOffice are still required for many formats; the library wraps them but does not eliminate them.
The README warns that local Docker builds can fail due to upstream changes in the wolfi-base image.
The open-source library is positioned as a stepping stone to the commercial Unstructured Platform, which offers better processing performance, chunking, and embedding—so production-grade features may push you toward the paid product.

Verdict

Worth a look if you are building RAG or document ingestion pipelines and need a unified interface to extract text from heterogeneous file types. Skip it if you only process plain text or already have a mature in-house document parsing stack.

Frequently asked

What is Unstructured-IO/unstructured?: The library exists to transform messy documents—PDFs, Word files, HTML, images—into clean, structured data that language models can actually ingest.
Is unstructured open source?: Yes — Unstructured-IO/unstructured is open source, released under the Apache-2.0 license.
What language is unstructured written in?: Unstructured-IO/unstructured is primarily written in HTML.
How popular is unstructured?: Unstructured-IO/unstructured has 15.2k stars on GitHub and is currently holding steady.
Where can I find unstructured?: Unstructured-IO/unstructured is on GitHub at https://github.com/Unstructured-IO/unstructured.