Is pdf-inspector open source?

Yes — firecrawl/pdf-inspector is open source, released under the MIT license.

What language is pdf-inspector written in?

firecrawl/pdf-inspector is primarily written in Rust.

How popular is pdf-inspector?

firecrawl/pdf-inspector has 1.6k stars on GitHub.

Where can I find pdf-inspector?

firecrawl/pdf-inspector is on GitHub at https://github.com/firecrawl/pdf-inspector.

← all repositories

firecrawl/pdf-inspector

PDF triage in Rust: fast text extraction, OCR as fallback

Built to stop pipelines from sending text-based PDFs through expensive OCR services.

★1.6k stars Rust Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

pdf-inspector is a Rust library that peeks inside a PDF, classifies it as text-based, scanned, image-based, or mixed, and extracts clean Markdown when possible. It handles tables, multi-column layouts, headings, lists, and even drop caps by reading the PDF’s content streams directly rather than treating every page as a picture. The document is parsed once and shared between detection and extraction, so it avoids redundant I/O.

The interesting bit

The classifier is deliberately stingy: it samples content streams for text operators and can exit early, sorting a 300-page PDF in milliseconds. Pages that lack text operators are flagged individually for OCR instead of condemning the entire file. The README claims this approach skips expensive OCR for roughly 54% of PDFs, and benchmark data on the 200-document opendataloader-bench corpus shows it is the fastest direct-text engine tested, though it trails opendataloader in overall accuracy.

Key highlights

Classifies PDFs in roughly 10–50 ms with per-page pages_needing_ocr routing instead of all-or-nothing decisions.
Converts text to Markdown using position-aware layout detection, including tables detected via both rectangle-based drawing operations and heuristic text alignment.
Pure Rust with a single dependency on lopdf; no ML models, no external services, with Python and Node.js bindings included.
Processed the benchmark corpus of 200 PDFs in 4 seconds, compared with 11 seconds for opendataloader, 18 for pymupdf4llm, and 8 for markitdown.
Detects multi-column reading order, RTL text, and broken font encodings it cannot handle so callers know when to fall back to OCR.

Caveats

Heading detection lags behind opendataloader because many PDFs use bold body text or only slightly larger font sizes for headings.
Table detection scores 0.59 TEDS, trailing OCR-based engines that can visually perceive table structure.
Overall benchmark score is 0.78, below opendataloader’s 0.84, though still above pymupdf4llm and markitdown.

Verdict

Ideal for high-volume pipelines that need to route PDFs away from unnecessary OCR; less ideal if your primary need is perfect heading hierarchy or complex visual table reconstruction without a fallback engine.

Frequently asked

What is firecrawl/pdf-inspector?: Built to stop pipelines from sending text-based PDFs through expensive OCR services.
Is pdf-inspector open source?: Yes — firecrawl/pdf-inspector is open source, released under the MIT license.
What language is pdf-inspector written in?: firecrawl/pdf-inspector is primarily written in Rust.
How popular is pdf-inspector?: firecrawl/pdf-inspector has 1.6k stars on GitHub.
Where can I find pdf-inspector?: firecrawl/pdf-inspector is on GitHub at https://github.com/firecrawl/pdf-inspector.