← all repositories
bytedance/Dolphin

ByteDance's 3B-parameter VLM that reads documents like a human does

A single model that classifies document type, analyzes layout, then parses elements in parallel—no pipeline of separate tools required.

9k stars Python Computer VisionData Tooling
Dolphin
Velocity · 7d
+23
★ / day
Trend
steady
star history

What it does

Dolphin-v2 turns document images and PDFs into structured output: JSON, Markdown, or individual parsed elements like tables, formulas, and code blocks. It handles both clean digital documents and messy photographed pages through a single vision-language model, rather than chaining together separate OCR, layout, and parsing tools.

The interesting bit

The two-stage architecture is the practical hook. Stage 1 classifies the document type and predicts reading order; Stage 2 then chooses its strategy—holistic parsing for photographed documents, parallel element-wise parsing for digital ones. The “heterogeneous anchor prompting” is essentially giving different element types (tables vs. formulas vs. text) different prompt templates, which the README claims improves accuracy without bloating the architecture.

Key highlights

  • Single VLM handles classification, layout analysis, and parsing—no external OCR or table extractors
  • 3B parameters for v2 (up from 0.3B in v1.5), with benchmarks showing improvement across text edit distance, formula CDM, and table TEDS scores on OmniDocBench
  • Parallel element decoding with configurable batch size for throughput tuning
  • Supports vLLM and TensorRT-LLM for accelerated inference
  • Hugging Face Transformers integration for standard model loading
  • Multi-page PDF parsing available since June 2025

Caveats

  • The demo link (http://115.190.42.15:8888/dolphin/) is HTTP and may be unreliable or region-blocked
  • “Call for Bad Cases” notice suggests the model still has visible failure modes the authors are cataloging
  • Changelog dates appear to use 2025 future dates (e.g., “2025.12.12”), which is likely a typo or non-standard dating—actual release timeline is unclear

Verdict

Worth evaluating if you’re currently maintaining a fragile pipeline of Tesseract + layout model + table extractor. Skip if you need guaranteed deterministic output or work in a domain with strict formatting requirements the model hasn’t seen—those “bad cases” are explicitly being collected for a reason.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.