← all repositories
oomol-lab/pdf-craft

Turn scanned book PDFs into clean Markdown or EPUB locally

A Python tool that uses DeepSeek OCR to convert scanned academic books into structured Markdown or EPUB without calling LLM APIs.

5.7k stars Python Computer VisionData Tooling
pdf-craft
Velocity · 7d
+12
★ / day
Trend
steady
star history

What it does

pdf-craft takes scanned PDFs—think textbooks, academic papers, old manuals—and converts them into Markdown or EPUB. It runs entirely on your machine using DeepSeek OCR models, pulling out body text while filtering headers, footers, and other noise. Footnotes, tables, formulas, and images all make it through to the output.

The interesting bit

The project deliberately ditched LLM-based text correction in v1.0.0 to go fully local. No API keys, no network calls, no rate limits—just GPU-accelerated OCR from PDF to finished document. The trade-off is real: you lose LLM polish, but gain speed and offline reliability. For TOC extraction, you can still optionally plug in an LLM if your book’s chapter hierarchy is particularly tortured.

Key highlights

  • Built on DeepSeek OCR with five model sizes from tiny to gundam (default: largest/highest quality)
  • Outputs Markdown or EPUB with automatic TOC generation for EPUB
  • Handles tables (HTML or image clipping), formulas (MathML, SVG, or clipping), and footnotes
  • Supports offline mode with pre-downloaded models via local_only=True
  • Configurable error handling: stop on failures, ignore them, or inject custom callbacks
  • Optional online demo at Inkora if you want to test before installing Poppler and CUDA

Caveats

  • Requires Poppler for PDF parsing and CUDA for OCR; the “quick start” pip install is not actually sufficient
  • DeepSeek OCR models download from Hugging Face on first run unless pre-cached
  • LLM text correction was removed in v1.0.0; still available in v0.2.8 if you need it

Verdict

Worth a look if you regularly digitize scanned books or academic PDFs and want a local, scriptable pipeline. Skip it if your PDFs are already text-based or if setting up Poppler + CUDA sounds like too much ceremony.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.