Python-free OCR stack that speaks OpenAI
A Rust rewrite of DeepSeek-OCR with three vision backends, DSQ quantization, and a drop-in HTTP server—no conda required.

What it does
This is a Rust workspace that runs document OCR and visual-language inference locally using Candle, with three model backends to choose from. You get a CLI for batch jobs and an HTTP server that exposes /v1/chat/completions and /v1/responses, so OpenAI SDKs and tools like Open WebUI connect without adapters. It downloads weights automatically from Hugging Face or ModelScope on first run.
The interesting bit
The project is essentially a Rust-native port of a Python + Transformers pipeline, but the rewrite buys you more than just memory safety. Prompt token construction runs ~97× faster than the reference Python stack, and the server automatically collapses chat history to a single turn so OCR outputs stay deterministic even when chatty clients send full conversation context.
Key highlights
- Three backends with clear trade-offs: DeepSeek-OCR (~13GB RAM, highest accuracy), PaddleOCR-VL (~9GB, lighter and faster), and DotsOCR (30–50GB for high-res layout/reading-order tasks).
- DSQ-quantized variants (Q4_K through Q8_0) for each backend to shrink weight memory.
- Apple Metal and x86 MKL support are first-class; NVIDIA CUDA is available but marked alpha.
- Shared
config.tomlkeeps CLI and server in sync; runtime overrides resolve cleanly from flags → config → defaults → request payload. - Pre-built macOS (Metal) and Windows binaries ship via GitHub Actions artifacts.
Caveats
- CUDA support is explicitly alpha: “expect rough edges while we finish kernel coverage.”
- DotsOCR’s vision tower is heavy; the README warns it can fall to ~12 tok/s on CPU and demands 30–50GB RAM/VRAM for high-resolution documents.
- Debug builds are “extremely slow”; you must compile
--releasefor usable throughput.
Verdict
Worth a look if you want local OCR without dragging in Python, conda, or the GIL—especially on Apple Silicon. Skip it if you need production-grade CUDA today or if your hardware can’t stomach the larger models’ memory appetite.