Turn scanned book PDFs into clean Markdown or EPUB locally
A Python tool that uses DeepSeek OCR to convert scanned academic books into structured Markdown or EPUB without calling LLM APIs.

What it does
pdf-craft takes scanned PDFs—think textbooks, academic papers, old manuals—and converts them into Markdown or EPUB. It runs entirely on your machine using DeepSeek OCR models, pulling out body text while filtering headers, footers, and other noise. Footnotes, tables, formulas, and images all make it through to the output.
The interesting bit
The project deliberately ditched LLM-based text correction in v1.0.0 to go fully local. No API keys, no network calls, no rate limits—just GPU-accelerated OCR from PDF to finished document. The trade-off is real: you lose LLM polish, but gain speed and offline reliability. For TOC extraction, you can still optionally plug in an LLM if your book’s chapter hierarchy is particularly tortured.
Key highlights
- Built on DeepSeek OCR with five model sizes from
tinytogundam(default: largest/highest quality) - Outputs Markdown or EPUB with automatic TOC generation for EPUB
- Handles tables (HTML or image clipping), formulas (MathML, SVG, or clipping), and footnotes
- Supports offline mode with pre-downloaded models via
local_only=True - Configurable error handling: stop on failures, ignore them, or inject custom callbacks
- Optional online demo at Inkora if you want to test before installing Poppler and CUDA
Caveats
- Requires Poppler for PDF parsing and CUDA for OCR; the “quick start” pip install is not actually sufficient
- DeepSeek OCR models download from Hugging Face on first run unless pre-cached
- LLM text correction was removed in v1.0.0; still available in v0.2.8 if you need it
Verdict
Worth a look if you regularly digitize scanned books or academic PDFs and want a local, scriptable pipeline. Skip it if your PDFs are already text-based or if setting up Poppler + CUDA sounds like too much ceremony.