Is liteparse open source?

Yes — run-llama/liteparse is open source, released under the Apache-2.0 license.

What language is liteparse written in?

run-llama/liteparse is primarily written in Rust.

How popular is liteparse?

run-llama/liteparse has 11.7k stars on GitHub and is currently accelerating.

Where can I find liteparse?

run-llama/liteparse is on GitHub at https://github.com/run-llama/liteparse.

← all repositories

run-llama/liteparse

A document parser that keeps your files local—until it can't

LiteParse gives developers a fast, offline way to extract structured text and bounding boxes from PDFs and Office files, bundling Tesseract OCR and admitting upfront that some documents are simply too messy for local tools.

★11.7k stars Rust Data Tooling

View on GitHub ↗ Homepage ↗

Velocity · 7d

+46

★ / day

Trend

↗accelerating

star history

What it does

LiteParse is a Rust-core document parser that turns PDFs, Office documents, and images into structured text or JSON with precise bounding boxes. It bundles Tesseract for zero-setup OCR, can proxy to external HTTP OCR services like EasyOCR or PaddleOCR, and renders page screenshots for LLM agents. Everything runs locally across Linux, macOS, Windows, and even the browser via WASM, with language bindings for Python, Node.js, and TypeScript.

The interesting bit

The README is unusually candid: it explicitly warns that dense tables, multi-column layouts, charts, and scanned PDFs will likely outperform the local tool and points you toward the vendor’s cloud parser, LlamaParse. That honesty is rarer than it should be in open-source tooling. The architecture is also pleasantly modular—PDFium handles text extraction, LibreOffice and ImageMagick handle format conversion, and the OCR layer is swappable via a simple HTTP API spec.

Key highlights

Runs fully offline with bundled Tesseract OCR, but accepts plug-in HTTP OCR servers for heavier workloads
Outputs plain text or structured JSON with spatial bounding boxes for every extracted element
Generates high-DPI page screenshots specifically designed for LLM agent consumption
Supports automatic conversion of Word, Excel, PowerPoint, and image files to PDF before parsing
Ships identical lit CLI across Rust, Python, and Node.js installations, plus a WASM build for browsers

Caveats

Complex documents—dense tables, handwritten text, scanned PDFs—are explicitly called out as weak spots compared to cloud-based alternatives
Office and image format conversions require external system dependencies (LibreOffice and ImageMagick) that are not bundled
The README is truncated in the provided source, so full performance characteristics and memory usage are unclear

Verdict

Developers building local-first document pipelines or LLM agent skills who can tolerate occasional OCR imperfections should look here. If you need guaranteed accuracy on gnarly scanned invoices or multi-column academic papers, the tool itself suggests you look elsewhere.

Frequently asked

What is run-llama/liteparse?: LiteParse gives developers a fast, offline way to extract structured text and bounding boxes from PDFs and Office files, bundling Tesseract OCR and admitting upfront that some documents are simply too messy for local tools.
Is liteparse open source?: Yes — run-llama/liteparse is open source, released under the Apache-2.0 license.
What language is liteparse written in?: run-llama/liteparse is primarily written in Rust.
How popular is liteparse?: run-llama/liteparse has 11.7k stars on GitHub and is currently accelerating.
Where can I find liteparse?: run-llama/liteparse is on GitHub at https://github.com/run-llama/liteparse.