One Rust core, seventeen language bindings, zero patience for PDFs
A document parser that speaks more languages than most developers, extracting text and structure from 90+ formats without GPU hand-holding.

What it does
Kreuzberg is a Rust-based document extraction engine that pulls text, metadata, images, and structured data from PDFs, Office files, images, archives, and code files across 90+ formats. It runs as a library, CLI, REST API, or MCP server, and ships native bindings for Python, Node, Go, Java, C#, PHP, Ruby, Elixir, R, Dart, Swift, Zig, C, and TypeScript via WASM.
The interesting bit
The polyglot sprawl is the product, not the sideshow. Kreuzberg’s “alef” binding generator keeps seventeen language APIs in sync from a single Rust core, including a WASI build that runs real Tesseract OCR in browsers. The TOON wire format trims token counts by 30–50% for LLM pipelines, which is the kind of boring optimization that actually matters at scale.
Key highlights
- Code intelligence via tree-sitter for 300+ languages, with semantic chunking
- OCR backends: Tesseract (all bindings), PaddleOCR (native), EasyOCR (Python), and VLM OCR through 143 providers including local engines
- Pure-Rust PDF parsing with SIMD and streaming support for multi-GB files
- Plugin architecture for custom extractors, validators, and renderers
- Docker images and Helm charts for API/CLI/MCP server deployment
- Elastic-2.0 license
Caveats
- Docker images are chunky: ~1.0–1.3GB even for the “core” build
- WASM build excludes ONNX Runtime features (PaddleOCR, layout detection, embeddings) and server modes
- macOS precompiled binaries are Apple Silicon only; Intel Macs need to build from source
- Windows support varies: Ruby and Docker lack precompiled binaries, Swift is absent entirely
Verdict
Worth a look if you’re building RAG pipelines, document workflows, or anything that needs to normalize the world’s file formats into clean Markdown. Skip it if you just need to parse the occasional PDF and don’t want a 1GB Docker image sitting around.