← all repositories
ucbepic/docetl

ETL pipelines that actually talk back

DocETL turns LLMs into pipeline operators you can prototype in a browser and ship to production.

docetl
Velocity · 7d
+5.4
★ / day
Trend
steady
star history

What it does DocETL is a Python framework for building document-processing pipelines where the transforms are LLM prompts rather than SQL or Python functions. It ships with DocWrangler, a browser-based playground for iterating on prompts step-by-step, then exporting the final pipeline config to run headless in production.

The interesting bit The split personality is the point: the same YAML-ish pipeline definition runs in a clicky UI for debugging and from the CLI for batch jobs. The README even suggests using Claude Code to write the pipeline that will later feed Claude via API—recursive automation with a straight face.

Key highlights

  • DocWrangler playground: hosted at docetl.org/playground, or run locally via Docker (make docker)
  • Production runner: pip install docetl, load the exported config, execute
  • Multi-provider: OpenAI out of the box, AWS Bedrock via liteLLM prefixing
  • Two .env files: root .env for the Python backend, website/.env.local for the TypeScript frontend—easy to mix up, so the README warns you twice
  • Paper-backed: arXiv preprint linked for the academically curious

Caveats

  • Requires Python 3.10+ and an OpenAI key just to get started; BYO API budget
  • Local setup is Makefile-heavy; the “manual” path still expects uv, pre-commit, and Node dependencies
  • The hosted playground is convenient, but any non-trivial run sends your documents to a third-party LLM

Verdict Grab this if you’re drowning in unstructured documents and want to replace brittle regexes with prompt-based transforms you can iterate on visually. Skip it if your data is already clean, tabular, or you flinch at per-token pricing on every pipeline run.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.