A 650M-parameter sun god for your PDFs
Surya does OCR, layout analysis, reading order, and table recognition in 90+ languages from a single VLM.

What it does Surya is a document-intelligence toolkit that runs OCR, layout detection, reading-order recovery, and table recognition through one 650M-parameter vision-language model. It reads images and PDFs, then emits structured JSON with per-block text, HTML, polygons, and confidence scores. A separate inference manager auto-spawns either a vLLM server (NVIDIA) or llama.cpp (CPU/Apple Silicon) so you don’t have to baby-sit the backend.
The interesting bit
The whole pipeline—layout, OCR, tables—shares a single VLM instead of chaining specialist models. That keeps the parameter count modest (under 3B) while still scoring 83.3% on olmOCR-bench. The trade-off is server lifecycle: by default the model spawns and dies with every CLI invocation, so the README explicitly warns you to pass --keep_server or set an env var if you’re processing more than one file.
Key highlights
- 650M params, 5 pages/s on an RTX 5090, 87.2% on an internal 91-language benchmark
- Output blocks carry canonical layout labels (
Text,Equation,Table, etc.), reading order, and HTML snippets - Block mode lets you pre-run layout, then OCR only text regions—one VLM call per page in default full-page mode
- Ships with a Streamlit GUI (
surya_gui) and granular CLI tools (surya_ocr,surya_layout,surya_table,surya_detect) - Apache 2.0 code; model weights are OpenRAIL-M (free for research, personal use, and sub-$5M startups)
Caveats
- Model weights are not fully open for commercial use; broader licensing requires a paid deal
- v1 to v2 migration breaks schemas:
text_linesbecameblocks, layout droppedtop_k, table cells lostis_header/colspan/rowspan - Throughput is backend-dependent and tuning-heavy—DPI, batch size, and MTP all matter
Verdict Worth a look if you need multilingual document parsing with structure and don’t want to wire together four separate models. Skip it if you need unrestricted commercial model weights or real-time streaming OCR.