A PDF parser that won't phone home or eat your RAM
Run-LLama's Rust-core tool extracts text, bounding boxes, and screenshots locally, with an escape hatch to cloud OCR when documents get nasty.

What it does LiteParse is a Rust-based document parser that chews through PDFs, Office files, and images to emit plain text, structured JSON with bounding boxes, or page screenshots. It bundles Tesseract for zero-config OCR, can swap in external OCR servers via a dead-simple HTTP API, and wraps everything in identical CLIs for Rust, Python, Node, and even WASM in the browser.
The interesting bit The architecture is deliberately unsexy in a good way: PDFium handles extraction, LibreOffice/ImageMagick convert odd formats, then a “grid projection” stage reconstructs spatial layout from merged native and OCR text. The README is unusually honest about its limits—complex tables, charts, and handwriting are explicitly punted to the vendor’s cloud LlamaParse service.
Key highlights
- Ships Tesseract built-in; no model downloads or API keys for basic OCR
- OCR backend is pluggable: drop in EasyOCR, PaddleOCR, or anything speaking its minimal HTTP spec
- Same
litCLI across npm, pip, and cargo; parses remote PDFs from stdin too - Auto-converts DOCX, XLSX, PPTX, and images via system LibreOffice/ImageMagick
- Page screenshot generation at configurable DPI, aimed at LLM agent workflows
Caveats
- Office and image conversion requires external system dependencies (LibreOffice, ImageMagick) that won’t be present in all environments
- The README itself warns that “dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs” will underperform versus cloud alternatives
Verdict Grab this if you need fast, private, local document ingestion for RAG pipelines or agent tools, and your documents are mostly standard text. Skip it if you’re processing gnarly scanned archives and don’t want to wire up a custom OCR server or pay for the cloud upgrade.