OCR that actually reads your weird old books
A turn-key OCR engine built for historical manuscripts, non-Latin scripts, and the messy reality of digitization.

What it does
kraken is a Python-based OCR system that handles the full pipeline: binarization, page segmentation, layout analysis, and text recognition. It outputs to standard formats like ALTO, PageXML, and hOCR. You install via pip/pipx, download a model from its public Zenodo repository, and run commands like kraken -i image.tif image.txt binarize segment ocr.
The interesting bit
Most OCR tools assume clean, modern, left-to-right Latin text. kraken starts from the opposite assumption: historical documents, right-to-left and top-to-bottom scripts, and mixed layouts are first-class citizens. The baseline segmenter and trainable reading-order detection suggest people who actually work with manuscripts built this.
Key highlights
- Supports RTL, BiDi, and top-to-bottom scripts natively
- Fully trainable layout analysis, reading order, and character recognition
- Public model repository on Zenodo;
kraken listshows available models - Outputs ALTO, PageXML, abbyyXML, and hOCR
- Tight integration with eScriptorium for GUI-based annotation and training
- Variable recognition network architecture (unclear from docs how “variable” this is in practice)
Caveats
- No Windows support; Linux and macOS only
- PDF and multi-image TIFF/JPEG2000 require extra
pip install kraken[pdf] - Python version lock: 3.10–3.13 only, and pipx may default to something incompatible
- You must manually fetch models before doing anything useful; no built-in defaults
Verdict
Digital humanists, archivists, and anyone wrangling pre-modern or non-Latin texts should look here first. If your use case is “scan a modern English invoice,” mainstream tools are probably less friction.