Is kraken open source?

Yes — mittagessen/kraken is open source, released under the Apache-2.0 license.

What language is kraken written in?

mittagessen/kraken is primarily written in Python.

How popular is kraken?

mittagessen/kraken has 1k stars on GitHub.

Where can I find kraken?

mittagessen/kraken is on GitHub at https://github.com/mittagessen/kraken.

← all repositories

mittagessen/kraken

OCR that actually reads your weird old books

A turn-key OCR engine built for historical manuscripts, non-Latin scripts, and the messy reality of digitization.

★1k stars Python Computer Vision Inference · Serving

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

kraken is a Python-based OCR system that handles the full pipeline: binarization, page segmentation, layout analysis, and text recognition. It outputs to standard formats like ALTO, PageXML, and hOCR. You install via pip/pipx, download a model from its public Zenodo repository, and run commands like kraken -i image.tif image.txt binarize segment ocr.

The interesting bit

Most OCR tools assume clean, modern, left-to-right Latin text. kraken starts from the opposite assumption: historical documents, right-to-left and top-to-bottom scripts, and mixed layouts are first-class citizens. The baseline segmenter and trainable reading-order detection suggest people who actually work with manuscripts built this.

Key highlights

Supports RTL, BiDi, and top-to-bottom scripts natively
Fully trainable layout analysis, reading order, and character recognition
Public model repository on Zenodo; kraken list shows available models
Outputs ALTO, PageXML, abbyyXML, and hOCR
Tight integration with eScriptorium for GUI-based annotation and training
Variable recognition network architecture (unclear from docs how “variable” this is in practice)

Caveats

No Windows support; Linux and macOS only
PDF and multi-image TIFF/JPEG2000 require extra pip install kraken[pdf]
Python version lock: 3.10–3.13 only, and pipx may default to something incompatible
You must manually fetch models before doing anything useful; no built-in defaults

Verdict

Digital humanists, archivists, and anyone wrangling pre-modern or non-Latin texts should look here first. If your use case is “scan a modern English invoice,” mainstream tools are probably less friction.

Frequently asked

What is mittagessen/kraken?: A turn-key OCR engine built for historical manuscripts, non-Latin scripts, and the messy reality of digitization.
Is kraken open source?: Yes — mittagessen/kraken is open source, released under the Apache-2.0 license.
What language is kraken written in?: mittagessen/kraken is primarily written in Python.
How popular is kraken?: mittagessen/kraken has 1k stars on GitHub.
Where can I find kraken?: mittagessen/kraken is on GitHub at https://github.com/mittagessen/kraken.