← all repositories
kreuzberg-dev/kreuzberg

One Rust core, seventeen language bindings, zero patience for PDFs

A document parser that speaks more languages than most developers, extracting text and structure from 90+ formats without GPU hand-holding.

8.5k stars Rust Data ToolingRAG · Search
kreuzberg
Velocity · 7d
+17
★ / day
Trend
steady
star history

What it does

Kreuzberg is a Rust-based document extraction engine that pulls text, metadata, images, and structured data from PDFs, Office files, images, archives, and code files across 90+ formats. It runs as a library, CLI, REST API, or MCP server, and ships native bindings for Python, Node, Go, Java, C#, PHP, Ruby, Elixir, R, Dart, Swift, Zig, C, and TypeScript via WASM.

The interesting bit

The polyglot sprawl is the product, not the sideshow. Kreuzberg’s “alef” binding generator keeps seventeen language APIs in sync from a single Rust core, including a WASI build that runs real Tesseract OCR in browsers. The TOON wire format trims token counts by 30–50% for LLM pipelines, which is the kind of boring optimization that actually matters at scale.

Key highlights

  • Code intelligence via tree-sitter for 300+ languages, with semantic chunking
  • OCR backends: Tesseract (all bindings), PaddleOCR (native), EasyOCR (Python), and VLM OCR through 143 providers including local engines
  • Pure-Rust PDF parsing with SIMD and streaming support for multi-GB files
  • Plugin architecture for custom extractors, validators, and renderers
  • Docker images and Helm charts for API/CLI/MCP server deployment
  • Elastic-2.0 license

Caveats

  • Docker images are chunky: ~1.0–1.3GB even for the “core” build
  • WASM build excludes ONNX Runtime features (PaddleOCR, layout detection, embeddings) and server modes
  • macOS precompiled binaries are Apple Silicon only; Intel Macs need to build from source
  • Windows support varies: Ruby and Docker lack precompiled binaries, Swift is absent entirely

Verdict

Worth a look if you’re building RAG pipelines, document workflows, or anything that needs to normalize the world’s file formats into clean Markdown. Skip it if you just need to parse the occasional PDF and don’t want a 1GB Docker image sitting around.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.