enoch3712/ExtractThinker
A Python library for extracting and classifying structured data from PDFs, images, and documents using LLMs with ORM-style document workflow abstractions.

ExtractThinker is a document intelligence library that leverages LLMs to extract and classify structured data from various document formats. It provides flexible document loaders supporting OCR engines like Tesseract, cloud services like AWS Textract and Google Document AI, and integrates with multiple LLM providers including OpenAI and Anthropic. Developers define custom extraction contracts using Pydantic models and can implement async processing with different splitting strategies for efficient large document handling.