← all repositories
mittagessen/kraken

OCR that actually reads your weird old books

A turn-key OCR engine built for historical manuscripts, non-Latin scripts, and the messy reality of digitization.

kraken
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

kraken is a Python-based OCR system that handles the full pipeline: binarization, page segmentation, layout analysis, and text recognition. It outputs to standard formats like ALTO, PageXML, and hOCR. You install via pip/pipx, download a model from its public Zenodo repository, and run commands like kraken -i image.tif image.txt binarize segment ocr.

The interesting bit

Most OCR tools assume clean, modern, left-to-right Latin text. kraken starts from the opposite assumption: historical documents, right-to-left and top-to-bottom scripts, and mixed layouts are first-class citizens. The baseline segmenter and trainable reading-order detection suggest people who actually work with manuscripts built this.

Key highlights

  • Supports RTL, BiDi, and top-to-bottom scripts natively
  • Fully trainable layout analysis, reading order, and character recognition
  • Public model repository on Zenodo; kraken list shows available models
  • Outputs ALTO, PageXML, abbyyXML, and hOCR
  • Tight integration with eScriptorium for GUI-based annotation and training
  • Variable recognition network architecture (unclear from docs how “variable” this is in practice)

Caveats

  • No Windows support; Linux and macOS only
  • PDF and multi-image TIFF/JPEG2000 require extra pip install kraken[pdf]
  • Python version lock: 3.10–3.13 only, and pipx may default to something incompatible
  • You must manually fetch models before doing anything useful; no built-in defaults

Verdict

Digital humanists, archivists, and anyone wrangling pre-modern or non-Latin texts should look here first. If your use case is “scan a modern English invoice,” mainstream tools are probably less friction.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.