A document parser that actually runs offline
DocStrange converts PDFs, scans, and Office files into structured Markdown or JSON—locally, if you want.

What it does DocStrange is a Python library that ingests PDFs, Word docs, PowerPoints, Excel sheets, images, and even URLs, then spits them out as Markdown, JSON, CSV, or HTML. It handles OCR on scanned documents and photos, extracts tables into clean formatting, and can target specific fields or conform to a custom JSON schema. There’s also a built-in local web UI for drag-and-drop conversion.
The interesting bit
The dual-mode architecture is the real hook: cloud processing is the default (free up to 10,000 documents per month), but flip gpu=True and the entire pipeline—OCR, layout detection, and a 7B-parameter model—runs 100% locally on your own hardware. No data leaves the machine. That’s increasingly rare in the “AI document processing” space where most tools are API-shaped black boxes.
Key highlights
- Supports PDF, DOCX, PPTX, XLSX, images, and URLs as inputs
- Outputs LLM-optimized Markdown, structured JSON with schema support, HTML, and CSV
- Local mode requires CUDA for GPU acceleration; CPU fallback is mentioned but not detailed
- Built-in web interface runs on localhost:8000 with
pip install "docstrange[web]" - MCP server integration for Claude Desktop document navigation
- Models download automatically on first local run
Caveats
- The README claims “works on GPU or CPU when running locally” but the local processing section only documents
gpu=Trueand notes CUDA is required; CPU behavior is unclear - Cloud mode is default, so privacy requires explicit opt-in to local mode
- “7B model” is referenced but not named or characterized beyond parameter count
Verdict Worth a look if you’re building RAG pipelines or data extraction workflows and need an escape hatch from cloud-only APIs. Skip it if you need transparent model provenance or guaranteed CPU-only local operation.