DeepSeek's OCR model learns to read like a human, not a scanner
A vision-language model that converts documents to markdown by compressing visual tokens through causal flow instead of brute-force grid encoding.

What it does
DeepSeek-OCR 2 takes images and PDFs and spits out structured markdown text. It runs via vLLM for streaming or batch inference, or through standard Transformers with flash attention. The model handles dynamic resolution by tiling up to six 768×768 patches plus one 1024×1024 base image, keeping visual token counts tight.
The interesting bit
The “Visual Causal Flow” pitch suggests the model encodes documents more like a human reading sequence than a flat CNN grid — though the README is thin on mechanics, pointing to the arXiv paper for the actual architecture. The prompt design is notably explicit: <|grounding|>Convert the document to markdown triggers layout-aware output, while Free OCR strips structure for plain text.
Key highlights
- Dual inference paths: vLLM for speed (streaming images, concurrent PDFs) and Transformers for direct Python integration
- Dynamic resolution tiling: (0-6)×768² + 1×1024² mapped to (0-6)×144 + 256 visual tokens
- Benchmarked against OmniDocBench v1.5
- Requires specific CUDA 11.8 + PyTorch 2.6.0 + vLLM 0.8.5 wheel; flash-attn 2.7.3 for attention
trust_remote_code=Trueneeded for both tokenizer and model loading
Caveats
- README is essentially a setup script with prompts; no accuracy numbers, no comparison tables, no training details
- vLLM and Transformers dependency versions are pinned tightly enough to potentially conflict in shared environments
- Paper is on arXiv with a 2026 citation date, which may be a preprint placeholder
Verdict
Worth a look if you’re building document pipelines and need a markdown-native OCR with controllable layout output. Skip it if you want proven benchmarks, clean dependency trees, or explanations of what “Visual Causal Flow” actually means without reading the paper.