ByteDance's 3B-parameter VLM that reads documents like a human does
A single model that classifies document type, analyzes layout, then parses elements in parallel—no pipeline of separate tools required.

What it does
Dolphin-v2 turns document images and PDFs into structured output: JSON, Markdown, or individual parsed elements like tables, formulas, and code blocks. It handles both clean digital documents and messy photographed pages through a single vision-language model, rather than chaining together separate OCR, layout, and parsing tools.
The interesting bit
The two-stage architecture is the practical hook. Stage 1 classifies the document type and predicts reading order; Stage 2 then chooses its strategy—holistic parsing for photographed documents, parallel element-wise parsing for digital ones. The “heterogeneous anchor prompting” is essentially giving different element types (tables vs. formulas vs. text) different prompt templates, which the README claims improves accuracy without bloating the architecture.
Key highlights
- Single VLM handles classification, layout analysis, and parsing—no external OCR or table extractors
- 3B parameters for v2 (up from 0.3B in v1.5), with benchmarks showing improvement across text edit distance, formula CDM, and table TEDS scores on OmniDocBench
- Parallel element decoding with configurable batch size for throughput tuning
- Supports vLLM and TensorRT-LLM for accelerated inference
- Hugging Face Transformers integration for standard model loading
- Multi-page PDF parsing available since June 2025
Caveats
- The demo link (http://115.190.42.15:8888/dolphin/) is HTTP and may be unreliable or region-blocked
- “Call for Bad Cases” notice suggests the model still has visible failure modes the authors are cataloging
- Changelog dates appear to use 2025 future dates (e.g., “2025.12.12”), which is likely a typo or non-standard dating—actual release timeline is unclear
Verdict
Worth evaluating if you’re currently maintaining a fragile pipeline of Tesseract + layout model + table extractor. Skip if you need guaranteed deterministic output or work in a domain with strict formatting requirements the model hasn’t seen—those “bad cases” are explicitly being collected for a reason.