Is pdf-craft open source?

Yes — oomol-lab/pdf-craft is open source, released under the MIT license.

What language is pdf-craft written in?

oomol-lab/pdf-craft is primarily written in Python.

How popular is pdf-craft?

oomol-lab/pdf-craft has 5.8k stars on GitHub.

Where can I find pdf-craft?

oomol-lab/pdf-craft is on GitHub at https://github.com/oomol-lab/pdf-craft.

← all repositories

oomol-lab/pdf-craft

Turn scanned book PDFs into clean Markdown or EPUB locally

A Python tool that uses DeepSeek OCR to convert scanned academic books into structured Markdown or EPUB without calling LLM APIs.

★5.8k stars Python Computer Vision Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

pdf-craft takes scanned PDFs—think textbooks, academic papers, old manuals—and converts them into Markdown or EPUB. It runs entirely on your machine using DeepSeek OCR models, pulling out body text while filtering headers, footers, and other noise. Footnotes, tables, formulas, and images all make it through to the output.

The interesting bit

The project deliberately ditched LLM-based text correction in v1.0.0 to go fully local. No API keys, no network calls, no rate limits—just GPU-accelerated OCR from PDF to finished document. The trade-off is real: you lose LLM polish, but gain speed and offline reliability. For TOC extraction, you can still optionally plug in an LLM if your book’s chapter hierarchy is particularly tortured.

Key highlights

Built on DeepSeek OCR with five model sizes from tiny to gundam (default: largest/highest quality)
Outputs Markdown or EPUB with automatic TOC generation for EPUB
Handles tables (HTML or image clipping), formulas (MathML, SVG, or clipping), and footnotes
Supports offline mode with pre-downloaded models via local_only=True
Configurable error handling: stop on failures, ignore them, or inject custom callbacks
Optional online demo at Inkora if you want to test before installing Poppler and CUDA

Caveats

Requires Poppler for PDF parsing and CUDA for OCR; the “quick start” pip install is not actually sufficient
DeepSeek OCR models download from Hugging Face on first run unless pre-cached
LLM text correction was removed in v1.0.0; still available in v0.2.8 if you need it

Verdict

Worth a look if you regularly digitize scanned books or academic PDFs and want a local, scriptable pipeline. Skip it if your PDFs are already text-based or if setting up Poppler + CUDA sounds like too much ceremony.

Frequently asked

What is oomol-lab/pdf-craft?: A Python tool that uses DeepSeek OCR to convert scanned academic books into structured Markdown or EPUB without calling LLM APIs.
Is pdf-craft open source?: Yes — oomol-lab/pdf-craft is open source, released under the MIT license.
What language is pdf-craft written in?: oomol-lab/pdf-craft is primarily written in Python.
How popular is pdf-craft?: oomol-lab/pdf-craft has 5.8k stars on GitHub.
Where can I find pdf-craft?: oomol-lab/pdf-craft is on GitHub at https://github.com/oomol-lab/pdf-craft.