← all repositories
wxyhgk/retain-pdf

Translate scanned PDFs without destroying the layout

A full-stack pipeline that OCRs, translates, and re-renders scientific PDFs while keeping formulas and formatting intact.

1.9k stars Python Other AI
retain-pdf
Velocity · 7d
+27
★ / day
Trend
steady
star history

What it does

RetainPDF ingests PDFs—including image-based and scanned ones—runs OCR, translates the text, and spits out a new PDF that looks like the original. It handles inline formulas, tables, and code blocks without mangling them. The project ships as a desktop app (Windows, macOS, Linux), a Docker stack, or raw components you can repurpose.

The interesting bit

Most open-source PDF translators assume your document has selectable text and simple math. RetainPDF explicitly targets the ugly cases: scanned pages, complex inline formulas, and mixed layouts. The architecture is deliberately decoupled—Rust handles the API and task orchestration, Python runs the OCR and rendering pipeline—so you can swap in your own translator or OCR engine without gutting the whole system.

Key highlights

  • Handles image-based and scanned PDFs, not just text-extractable ones
  • Attempts to preserve inline formulas, tables, and code blocks (with configurable rules)
  • Claims better output file size and font control than some closed-source alternatives
  • Full-stack delivery: static frontend, Rust API, Python pipeline, Electron desktop, Docker compose
  • MIT licensed; structured for extension and module replacement

Caveats

  • macOS desktop builds are unsigned; you must strip quarantine flags manually
  • The React frontend is still in migration; production uses a static frontend
  • Documentation warns that docs, APIs, and configs may drift out of sync

Verdict

Worth a look if you regularly translate scientific papers or technical books and need the output to stay readable. Skip it if you only translate clean, text-based PDFs—simpler tools will do.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.