← all repositories
google/langextract

Google's LLM-powered highlighter for messy documents

LangExtract turns wall-of-text documents into structured, verifiable data by making the LLM show its work.

36.8k stars Python Data ToolingLanguage Models
langextract
Velocity · 7d
+110
★ / day
Trend
steady
star history

What it does LangExtract is a Python library that uses LLMs to pull structured information out of unstructured text—clinical notes, novels, reports—based on instructions and a few examples you provide. It chunks long documents, runs parallel passes for better recall, and maps every extracted entity back to its exact character position in the source text.

The interesting bit The library doesn’t just return JSON and hope you trust it. It forces grounding: extractions that can’t be located in the source text get char_interval = None, so you can filter them out. It also generates a self-contained interactive HTML visualization that highlights extractions in their original context—useful when you’re reviewing hundreds of entities from a full novel.

Key highlights

  • Source-grounded extractions with exact character intervals for traceability
  • Built-in chunking, parallel processing (max_workers), and multiple extraction passes for long documents
  • Interactive HTML visualization from JSONL output, no external dependencies
  • Supports Gemini (default: gemini-3.5-flash), OpenAI, and local models via Ollama
  • Optional Vertex AI Batch API for cost-sensitive large-scale processing
  • Controlled generation with schema-constrained outputs on supported models

Caveats

  • Cloud models require an API key; the README nudges you toward paid Gemini tiers for production throughput
  • Gemini models have defined retirement dates, so you’ll need to track model lifecycle documentation
  • The library warns about “prompt alignment” if your few-shot examples aren’t verbatim and in order—examples drive behavior, and sloppy examples mean sloppy extractions

Verdict Worth a look if you need to extract entities from long documents and actually verify where they came from. Skip it if you just need quick unstructured summarization or can’t stomach API costs and model churn.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.