← all repositories
NanoNets/docstrange

A document parser that actually runs offline

DocStrange converts PDFs, scans, and Office files into structured Markdown or JSON—locally, if you want.

1.5k stars Python Data Tooling
docstrange
Velocity · 7d
+4.8
★ / day
Trend
steady
star history

What it does DocStrange is a Python library that ingests PDFs, Word docs, PowerPoints, Excel sheets, images, and even URLs, then spits them out as Markdown, JSON, CSV, or HTML. It handles OCR on scanned documents and photos, extracts tables into clean formatting, and can target specific fields or conform to a custom JSON schema. There’s also a built-in local web UI for drag-and-drop conversion.

The interesting bit The dual-mode architecture is the real hook: cloud processing is the default (free up to 10,000 documents per month), but flip gpu=True and the entire pipeline—OCR, layout detection, and a 7B-parameter model—runs 100% locally on your own hardware. No data leaves the machine. That’s increasingly rare in the “AI document processing” space where most tools are API-shaped black boxes.

Key highlights

  • Supports PDF, DOCX, PPTX, XLSX, images, and URLs as inputs
  • Outputs LLM-optimized Markdown, structured JSON with schema support, HTML, and CSV
  • Local mode requires CUDA for GPU acceleration; CPU fallback is mentioned but not detailed
  • Built-in web interface runs on localhost:8000 with pip install "docstrange[web]"
  • MCP server integration for Claude Desktop document navigation
  • Models download automatically on first local run

Caveats

  • The README claims “works on GPU or CPU when running locally” but the local processing section only documents gpu=True and notes CUDA is required; CPU behavior is unclear
  • Cloud mode is default, so privacy requires explicit opt-in to local mode
  • “7B model” is referenced but not named or characterized beyond parameter count

Verdict Worth a look if you’re building RAG pipelines or data extraction workflows and need an escape hatch from cloud-only APIs. Skip it if you need transparent model provenance or guaranteed CPU-only local operation.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.