← all repositories
deepseek-ai/DeepSeek-OCR-2

DeepSeek's OCR model learns to read like a human, not a scanner

A vision-language model that converts documents to markdown by compressing visual tokens through causal flow instead of brute-force grid encoding.

DeepSeek-OCR-2
Velocity · 7d
+22
★ / day
Trend
steady
star history

What it does

DeepSeek-OCR 2 takes images and PDFs and spits out structured markdown text. It runs via vLLM for streaming or batch inference, or through standard Transformers with flash attention. The model handles dynamic resolution by tiling up to six 768×768 patches plus one 1024×1024 base image, keeping visual token counts tight.

The interesting bit

The “Visual Causal Flow” pitch suggests the model encodes documents more like a human reading sequence than a flat CNN grid — though the README is thin on mechanics, pointing to the arXiv paper for the actual architecture. The prompt design is notably explicit: <|grounding|>Convert the document to markdown triggers layout-aware output, while Free OCR strips structure for plain text.

Key highlights

  • Dual inference paths: vLLM for speed (streaming images, concurrent PDFs) and Transformers for direct Python integration
  • Dynamic resolution tiling: (0-6)×768² + 1×1024² mapped to (0-6)×144 + 256 visual tokens
  • Benchmarked against OmniDocBench v1.5
  • Requires specific CUDA 11.8 + PyTorch 2.6.0 + vLLM 0.8.5 wheel; flash-attn 2.7.3 for attention
  • trust_remote_code=True needed for both tokenizer and model loading

Caveats

  • README is essentially a setup script with prompts; no accuracy numbers, no comparison tables, no training details
  • vLLM and Transformers dependency versions are pinned tightly enough to potentially conflict in shared environments
  • Paper is on arXiv with a 2026 citation date, which may be a preprint placeholder

Verdict

Worth a look if you’re building document pipelines and need a markdown-native OCR with controllable layout output. Skip it if you want proven benchmarks, clean dependency trees, or explanations of what “Visual Causal Flow” actually means without reading the paper.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.