Is OCRmyPDF open source?

Yes — ocrmypdf/OCRmyPDF is open source, released under the MPL-2.0 license.

What language is OCRmyPDF written in?

ocrmypdf/OCRmyPDF is primarily written in Python.

How popular is OCRmyPDF?

ocrmypdf/OCRmyPDF has 34.3k stars on GitHub and is currently holding steady.

Where can I find OCRmyPDF?

ocrmypdf/OCRmyPDF is on GitHub at https://github.com/ocrmypdf/OCRmyPDF.

← all repositories

ocrmypdf/OCRmyPDF

PDF OCR that respects your images and your clipboard

OCRmyPDF exists because most free OCR tools botch text placement, bloat file sizes, or mangle image resolution when trying to make scanned documents searchable.

★34.3k stars Python Computer Vision

View on GitHub ↗ Homepage ↗

Velocity · 7d

+11

★ / day

Trend

→steady

star history

What it does

OCRmyPDF is a command-line utility that adds a hidden text layer to scanned PDFs and images, producing validated PDF/A files that you can actually search and copy-paste from. It wraps Tesseract OCR and aims to leave the original visual content untouched—preserving image resolution and, when possible, inserting text without disrupting existing PDF structure. It also optimizes images during the process, often yielding files smaller than the input.

The interesting bit

The author built this after finding that existing free tools either misplaced text (breaking copy-paste), bloated files, or crashed entirely. The tool treats OCR as a surgical insertion rather than a document rebuild: it keeps your original embedded images at exact resolution, aligns recognized text accurately beneath them, and can even deskew or rotate pages before processing. A plugin interface lets you swap Tesseract for Apple Vision, EasyOCR, or PaddleOCR if you prefer a different engine.

Key highlights

Produces validated PDF/A output by default, targeting long-term archival compliance
Attempts lossless text insertion without altering existing PDF content or image resolution
Supports 100+ languages via Tesseract, with automatic parallelization across CPU cores
Optimizes images during processing and often outputs smaller files than it received
Scales to thousands of pages; the README claims it is “battle-tested on millions of PDFs”

Caveats

Requires external system binaries: Ghostscript and Tesseract must be installed separately
Explicitly a scriptable command-line program, not a GUI tool

Verdict

Worth a look if you manage document archives or need to automate OCR in a pipeline. Skip it if you need a visual point-and-click tool or if your PDFs are already searchable.

Frequently asked

What is ocrmypdf/OCRmyPDF?: OCRmyPDF exists because most free OCR tools botch text placement, bloat file sizes, or mangle image resolution when trying to make scanned documents searchable.
Is OCRmyPDF open source?: Yes — ocrmypdf/OCRmyPDF is open source, released under the MPL-2.0 license.
What language is OCRmyPDF written in?: ocrmypdf/OCRmyPDF is primarily written in Python.
How popular is OCRmyPDF?: ocrmypdf/OCRmyPDF has 34.3k stars on GitHub and is currently holding steady.
Where can I find OCRmyPDF?: ocrmypdf/OCRmyPDF is on GitHub at https://github.com/ocrmypdf/OCRmyPDF.