kreuzberg-dev/kreuzberg

One Rust core, seventeen language bindings, zero patience for PDFs

A document parser that speaks more languages than most developers, extracting text and structure from 90+ formats without GPU hand-holding.

★8.5k stars Rust Data Tooling RAG · Search

View on GitHub ↗ Homepage ↗

Velocity · 7d

+17

★ / day

Trend

→steady

star history

What it does

Kreuzberg is a Rust-based document extraction engine that pulls text, metadata, images, and structured data from PDFs, Office files, images, archives, and code files across 90+ formats. It runs as a library, CLI, REST API, or MCP server, and ships native bindings for Python, Node, Go, Java, C#, PHP, Ruby, Elixir, R, Dart, Swift, Zig, C, and TypeScript via WASM.

The interesting bit

The polyglot sprawl is the product, not the sideshow. Kreuzberg’s “alef” binding generator keeps seventeen language APIs in sync from a single Rust core, including a WASI build that runs real Tesseract OCR in browsers. The TOON wire format trims token counts by 30–50% for LLM pipelines, which is the kind of boring optimization that actually matters at scale.

Key highlights

Code intelligence via tree-sitter for 300+ languages, with semantic chunking
OCR backends: Tesseract (all bindings), PaddleOCR (native), EasyOCR (Python), and VLM OCR through 143 providers including local engines
Pure-Rust PDF parsing with SIMD and streaming support for multi-GB files
Plugin architecture for custom extractors, validators, and renderers
Docker images and Helm charts for API/CLI/MCP server deployment
Elastic-2.0 license

Caveats

Docker images are chunky: ~1.0–1.3GB even for the “core” build
WASM build excludes ONNX Runtime features (PaddleOCR, layout detection, embeddings) and server modes
macOS precompiled binaries are Apple Silicon only; Intel Macs need to build from source
Windows support varies: Ruby and Docker lack precompiled binaries, Swift is absent entirely

Verdict

Worth a look if you’re building RAG pipelines, document workflows, or anything that needs to normalize the world’s file formats into clean Markdown. Skip it if you just need to parse the occasional PDF and don’t want a 1GB Docker image sitting around.