The PDF OCR tool that actually preserves your scans
A command-line utility that adds searchable text layers to scanned PDFs without mangling image quality or bloating file sizes.
What it does
OCRmyPDF takes scanned PDFs (or images) and adds a hidden text layer underneath each page image, making them searchable and copy-pasteable. It outputs validated PDF/A by default, runs across all CPU cores, and handles 100+ languages via Tesseract. It’s pure Python and packaged for basically every platform: apt, dnf, brew, ports, snap, nix, even Docker for x64 and ARM.
The interesting bit
The author built this out of frustration with existing tools that broke copy/paste, mangled image resolution, produced comically oversized files, or crashed outright. The “lossless” insertion approach—adding OCR text without touching other content—sounds like table stakes, but apparently wasn’t. The plugin architecture is a nice touch: swap in Apple Vision on macOS, EasyOCR, or PaddleOCR if Tesseract isn’t your thing.
Key highlights
- Preserves original image resolution; often compresses output smaller than input
- Deskews, rotates, and cleans pages on request before OCR
- Validates both input and output files
- Scales to thousands of pages; claims to be “battle-tested on millions of PDFs”
- MPL-2.0 license: usable in commercial projects, but modifications to OCRmyPDF itself must be shared
Caveats
- Requires external dependencies: Ghostscript and Tesseract must be installed separately
- GPU “strongly recommended” for the EasyOCR and PaddleOCR plugins
Verdict
If you’re digitizing paper archives, building a document pipeline, or just tired of unsearchable scans, this is a solid, well-maintained workhorse. Skip it if you need a GUI or real-time OCR; this is strictly batch/command-line.