← all repositories
datalab-to/surya

A 650M-parameter sun god for your PDFs

Surya does OCR, layout analysis, reading order, and table recognition in 90+ languages from a single VLM.

20.7k stars Python Computer Vision
surya
Velocity · 7d
+24
★ / day
Trend
steady
star history

What it does Surya is a document-intelligence toolkit that runs OCR, layout detection, reading-order recovery, and table recognition through one 650M-parameter vision-language model. It reads images and PDFs, then emits structured JSON with per-block text, HTML, polygons, and confidence scores. A separate inference manager auto-spawns either a vLLM server (NVIDIA) or llama.cpp (CPU/Apple Silicon) so you don’t have to baby-sit the backend.

The interesting bit The whole pipeline—layout, OCR, tables—shares a single VLM instead of chaining specialist models. That keeps the parameter count modest (under 3B) while still scoring 83.3% on olmOCR-bench. The trade-off is server lifecycle: by default the model spawns and dies with every CLI invocation, so the README explicitly warns you to pass --keep_server or set an env var if you’re processing more than one file.

Key highlights

  • 650M params, 5 pages/s on an RTX 5090, 87.2% on an internal 91-language benchmark
  • Output blocks carry canonical layout labels (Text, Equation, Table, etc.), reading order, and HTML snippets
  • Block mode lets you pre-run layout, then OCR only text regions—one VLM call per page in default full-page mode
  • Ships with a Streamlit GUI (surya_gui) and granular CLI tools (surya_ocr, surya_layout, surya_table, surya_detect)
  • Apache 2.0 code; model weights are OpenRAIL-M (free for research, personal use, and sub-$5M startups)

Caveats

  • Model weights are not fully open for commercial use; broader licensing requires a paid deal
  • v1 to v2 migration breaks schemas: text_lines became blocks, layout dropped top_k, table cells lost is_header/colspan/rowspan
  • Throughput is backend-dependent and tuning-heavy—DPI, batch size, and MTP all matter

Verdict Worth a look if you need multilingual document parsing with structure and don’t want to wire together four separate models. Skip it if you need unrestricted commercial model weights or real-time streaming OCR.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.