← all repositories
tesseract-ocr/tesseract

The 1985 OCR engine that outlived its creators

HP's abandoned text-recognition project became the open-source default for turning images into words.

74.6k stars C++ Computer Vision
tesseract
Velocity · 7d
+17
★ / day
Trend
steady
star history

What it does

Tesseract extracts text from images. It ships as both a C/C++ library (libtesseract) and a command-line tool that reads PNG, JPEG, or TIFF and spits out plain text, PDF, hOCR, or structured formats like ALTO and PAGE. It claims support for 100+ languages out of the box, provided you download the right trained data files.

The interesting bit

This thing is old enough to rent a car in the US. Born at HP Labs in 1985, open-sourced in 2005, shepherded by Google for a decade, and now on version 5 under community maintenance. The engine switched from pattern-matching characters to LSTM neural-net line recognition in version 4, yet still carries the legacy engine for backward compatibility. Archaeology and engineering in one repo.

Key highlights

  • Dual-engine architecture: LSTM neural net (default) or legacy Tesseract 3 mode via --oem 0
  • 100+ languages supported with downloadable traineddata files
  • Multiple output formats: text, hOCR, PDF (including invisible-text-only), TSV, ALTO, PAGE
  • C and C++ APIs; third-party wrappers for other languages exist
  • Apache 2.0 licensed; depends on Leptonica for image I/O

Caveats

  • No GUI included; you’ll need a third-party wrapper or build your own
  • Image quality matters significantly; the docs explicitly warn you’ll need to preprocess for best results
  • Building from source requires checking compiler support against a specific list

Verdict

Use it if you need battle-tested OCR without vendor lock-in or cloud bills. Skip it if you want a polished one-click app or if your images are already clean enough for a simpler tool.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.