yobix-ai/extractous
A high-performance Rust library for extracting text, metadata, and structure from unstructured documents like PDFs and Word files.

Extractous is a document content extraction library written in Rust that processes PDFs, Word documents, HTML, and other formats. It provides language bindings for Python, Node.js, Go, and other languages. The project explicitly positions itself as infrastructure for RAG and LLM workflows, claiming 25x faster performance than the unstructured-io library commonly used in AI document processing pipelines. It includes OCR capabilities and is designed to feed extracted content into machine learning and NLP systems.