← all repositories
firecrawl/pdf-inspector

A Rust PDF triage nurse that decides if you really need OCR

Built by Firecrawl to dodge expensive OCR for the ~54% of PDFs that are already text-based.

1.4k stars Rust Data Tooling
pdf-inspector
Velocity · 7d
+11
★ / day
Trend
steady
star history

What it does

pdf-inspector is a Rust library that classifies PDFs (text-based, scanned, image-based, or mixed) in roughly 10–50ms, then extracts text and converts it to Markdown — no OCR, no ML models, no external services. It comes with Python and Node.js bindings, plus CLI tools for batch conversion and detection-only mode.

The interesting bit

The classification is surgical: it samples content streams for text operators (Tj/TJ) and image operators (Do), then returns a confidence score and a list of specific pages needing OCR. This means you can route only the bad pages to an OCR service instead of blasting the whole document. The README is unusually honest about where it trails OCR-based competitors — heading detection struggles when PDFs use bold body text as headings, and table detection lacks the visual understanding that image-based engines get.

Key highlights

  • Single document load shared between detection and extraction — no redundant I/O
  • Position-aware extraction with multi-column layout detection and RTL support
  • Dual-mode table detection: rectangle-based from PDF drawing ops plus heuristic alignment-based
  • CID font support with ToUnicode CMap decoding for gnarly encodings (UTF-16BE, Identity-H)
  • Fastest in its benchmark class: 4 seconds for 200 PDFs vs. 11–18s for direct-text competitors (OCR/ML engines take 2–180 minutes)
  • Per-page OCR routing with configurable scan strategies: early-exit, full scan, or sampled pages

Caveats

  • Heading detection lags behind opendataloader (0.57 vs. 0.74 MHS score) — font-size heuristics miss edge cases
  • Table detection (0.59 TEDS) trails OCR-based engines that can actually “see” visual table borders
  • Python install currently requires building from source with maturin develop --release — no PyPI package mentioned

Verdict

Grab this if you’re running high-volume PDF pipelines and want to cut OCR costs by half. Skip it if you need perfect heading extraction or heavily visual document layouts where image-based understanding matters more than speed.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.