The 1985 OCR engine that outlived its creators
HP's abandoned text-recognition project became the open-source default for turning images into words.

What it does
Tesseract extracts text from images. It ships as both a C/C++ library (libtesseract) and a command-line tool that reads PNG, JPEG, or TIFF and spits out plain text, PDF, hOCR, or structured formats like ALTO and PAGE. It claims support for 100+ languages out of the box, provided you download the right trained data files.
The interesting bit
This thing is old enough to rent a car in the US. Born at HP Labs in 1985, open-sourced in 2005, shepherded by Google for a decade, and now on version 5 under community maintenance. The engine switched from pattern-matching characters to LSTM neural-net line recognition in version 4, yet still carries the legacy engine for backward compatibility. Archaeology and engineering in one repo.
Key highlights
- Dual-engine architecture: LSTM neural net (default) or legacy Tesseract 3 mode via
--oem 0 - 100+ languages supported with downloadable traineddata files
- Multiple output formats: text, hOCR, PDF (including invisible-text-only), TSV, ALTO, PAGE
- C and C++ APIs; third-party wrappers for other languages exist
- Apache 2.0 licensed; depends on Leptonica for image I/O
Caveats
- No GUI included; you’ll need a third-party wrapper or build your own
- Image quality matters significantly; the docs explicitly warn you’ll need to preprocess for best results
- Building from source requires checking compiler support against a specific list
Verdict
Use it if you need battle-tested OCR without vendor lock-in or cloud bills. Skip it if you want a polished one-click app or if your images are already clean enough for a simpler tool.