DeepSeek's OCR model treats vision as a compression problem
An LLM-centric vision encoder that squeezes documents into surprisingly few tokens, then lets the language model do the actual reading.

What it does DeepSeek-OCR is a multimodal model that converts images and PDFs into structured text—markdown, raw OCR, or detailed descriptions. It runs via vLLM or Transformers, supports resolutions up to 1280×1280, and offers a quirky “Gundam” dynamic-resolution mode that tiles multiple 640×640 patches alongside a single 1024×1024 anchor. The project also ships with a successor, DeepSeek-OCR2, released January 2026.
The interesting bit
The framing is the twist: instead of treating vision encoding as a separate art, DeepSeek-OCR investigates it from an “LLM-centric viewpoint,” essentially asking how little visual information can be compressed before the language model loses comprehension. The custom NGramPerReqLogitsProcessor in vLLM—whitelisting specific token IDs like <td>—suggests tight coupling between decoding constraints and document structure.
Key highlights
- Token budgets are aggressively low: 64 tokens for 512×512, 256 for 1024×1024
- PDF inference hits ~2500 tokens/s on an A100-40GB via vLLM
- Supports batched benchmark evaluation and streaming image output
- Multiple prompt modes: grounding-aware markdown conversion, layout-free OCR, figure parsing, even reference location (
<|ref|>text<|/ref|>) - Upstream vLLM integration landed October 2025, no custom forks needed
Caveats
- Setup is finicky: pinned to CUDA 11.8, torch 2.6.0, and a specific vLLM 0.8.5 wheel; flash-attn must be built from source
- The README’s “2026/01/27” release date for OCR2 is either a typo or time travel
- “Gundam” dynamic resolution is supported but unexplained—no docs on when to use it or why it’s named after a mecha
Verdict Worth a look if you’re building document pipelines and want to trade vision-encoder complexity for LLM inference efficiency. Skip if you need battle-tested OCR without the PyTorch dependency headache.