← all repositories
getomni-ai/zerox

OCR by asking GPT-4o to read screenshots of your PDF

Zerox turns documents into images and feeds them to vision models, because tables and weird layouts break traditional text extractors.

12.2k stars TypeScript Data Tooling
zerox
Velocity · 7d
+18
★ / day
Trend
steady
star history

What it does Zerox converts PDFs, DOCX files, and images into a sequence of page images, then asks a vision model (GPT-4o, Claude 3, Gemini, etc.) to transcribe each one into Markdown. It aggregates the results and hands back structured text with optional per-page metadata like token counts and completion time. There’s also a JSON Schema extraction mode if you need structured data instead of raw Markdown.

The interesting bit The maintainFormat option chains pages sequentially: page 1’s Markdown gets fed into the prompt for page 2, which helps preserve tables that span page breaks. The trade-off is speed — concurrency drops to 1. It’s a blunt but effective hack for a genuinely hard layout problem.

Key highlights

  • Supports OpenAI, Azure, AWS Bedrock, and Google Gemini with provider-specific credential passing
  • Node and Python SDKs, though feature parity is uneven (Node has schema extraction and orientation correction; Python has custom system prompts and Vertex AI)
  • Optional per-page structured extraction with a separate model/provider from the OCR step
  • Configurable concurrency, DPI, image compression, and temp directory management
  • Requires system dependencies: graphicsmagick/ghostscript for Node, poppler for Python

Caveats

  • The Node and Python versions diverge significantly; check the feature matrix before picking one
  • No pricing or latency benchmarks are provided, and token counts from the example suggest multi-page documents could get expensive quickly
  • README mentions Tesseract workers in the API but doesn’t explain when they activate vs. vision-model fallback

Verdict Worth a look if you’re already paying for vision-model API access and your documents have complex layouts that defeat conventional OCR. Skip it if you need deterministic, offline, or low-cost extraction — this is fundamentally a cloud-API wrapper with image preprocessing glue.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.