← all repositories
ocrmypdf/OCRmyPDF

The PDF OCR tool that actually preserves your scans

A command-line utility that adds searchable text layers to scanned PDFs without mangling image quality or bloating file sizes.

33.8k stars Python Computer Vision
OCRmyPDF
Velocity · 7d
+7.4
★ / day
Trend
steady
star history

What it does

OCRmyPDF takes scanned PDFs (or images) and adds a hidden text layer underneath each page image, making them searchable and copy-pasteable. It outputs validated PDF/A by default, runs across all CPU cores, and handles 100+ languages via Tesseract. It’s pure Python and packaged for basically every platform: apt, dnf, brew, ports, snap, nix, even Docker for x64 and ARM.

The interesting bit

The author built this out of frustration with existing tools that broke copy/paste, mangled image resolution, produced comically oversized files, or crashed outright. The “lossless” insertion approach—adding OCR text without touching other content—sounds like table stakes, but apparently wasn’t. The plugin architecture is a nice touch: swap in Apple Vision on macOS, EasyOCR, or PaddleOCR if Tesseract isn’t your thing.

Key highlights

  • Preserves original image resolution; often compresses output smaller than input
  • Deskews, rotates, and cleans pages on request before OCR
  • Validates both input and output files
  • Scales to thousands of pages; claims to be “battle-tested on millions of PDFs”
  • MPL-2.0 license: usable in commercial projects, but modifications to OCRmyPDF itself must be shared

Caveats

  • Requires external dependencies: Ghostscript and Tesseract must be installed separately
  • GPU “strongly recommended” for the EasyOCR and PaddleOCR plugins

Verdict

If you’re digitizing paper archives, building a document pipeline, or just tired of unsearchable scans, this is a solid, well-maintained workhorse. Skip it if you need a GUI or real-time OCR; this is strictly batch/command-line.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.