Google's LLM-powered highlighter for messy documents
LangExtract turns wall-of-text documents into structured, verifiable data by making the LLM show its work.

What it does LangExtract is a Python library that uses LLMs to pull structured information out of unstructured text—clinical notes, novels, reports—based on instructions and a few examples you provide. It chunks long documents, runs parallel passes for better recall, and maps every extracted entity back to its exact character position in the source text.
The interesting bit
The library doesn’t just return JSON and hope you trust it. It forces grounding: extractions that can’t be located in the source text get char_interval = None, so you can filter them out. It also generates a self-contained interactive HTML visualization that highlights extractions in their original context—useful when you’re reviewing hundreds of entities from a full novel.
Key highlights
- Source-grounded extractions with exact character intervals for traceability
- Built-in chunking, parallel processing (
max_workers), and multiple extraction passes for long documents - Interactive HTML visualization from JSONL output, no external dependencies
- Supports Gemini (default:
gemini-3.5-flash), OpenAI, and local models via Ollama - Optional Vertex AI Batch API for cost-sensitive large-scale processing
- Controlled generation with schema-constrained outputs on supported models
Caveats
- Cloud models require an API key; the README nudges you toward paid Gemini tiers for production throughput
- Gemini models have defined retirement dates, so you’ll need to track model lifecycle documentation
- The library warns about “prompt alignment” if your few-shot examples aren’t verbatim and in order—examples drive behavior, and sloppy examples mean sloppy extractions
Verdict Worth a look if you need to extract entities from long documents and actually verify where they came from. Skip it if you just need quick unstructured summarization or can’t stomach API costs and model churn.