← all repositories
deepseek-ai/DeepSeek-OCR

DeepSeek's OCR model treats vision as a compression problem

An LLM-centric vision encoder that squeezes documents into surprisingly few tokens, then lets the language model do the actual reading.

DeepSeek-OCR
Velocity · 7d
+99
★ / day
Trend
steady
star history

What it does DeepSeek-OCR is a multimodal model that converts images and PDFs into structured text—markdown, raw OCR, or detailed descriptions. It runs via vLLM or Transformers, supports resolutions up to 1280×1280, and offers a quirky “Gundam” dynamic-resolution mode that tiles multiple 640×640 patches alongside a single 1024×1024 anchor. The project also ships with a successor, DeepSeek-OCR2, released January 2026.

The interesting bit The framing is the twist: instead of treating vision encoding as a separate art, DeepSeek-OCR investigates it from an “LLM-centric viewpoint,” essentially asking how little visual information can be compressed before the language model loses comprehension. The custom NGramPerReqLogitsProcessor in vLLM—whitelisting specific token IDs like <td>—suggests tight coupling between decoding constraints and document structure.

Key highlights

  • Token budgets are aggressively low: 64 tokens for 512×512, 256 for 1024×1024
  • PDF inference hits ~2500 tokens/s on an A100-40GB via vLLM
  • Supports batched benchmark evaluation and streaming image output
  • Multiple prompt modes: grounding-aware markdown conversion, layout-free OCR, figure parsing, even reference location (<|ref|>text<|/ref|>)
  • Upstream vLLM integration landed October 2025, no custom forks needed

Caveats

  • Setup is finicky: pinned to CUDA 11.8, torch 2.6.0, and a specific vLLM 0.8.5 wheel; flash-attn must be built from source
  • The README’s “2026/01/27” release date for OCR2 is either a typo or time travel
  • “Gundam” dynamic resolution is supported but unexplained—no docs on when to use it or why it’s named after a mecha

Verdict Worth a look if you’re building document pipelines and want to trade vision-encoder complexity for LLM inference efficiency. Skip if you need battle-tested OCR without the PyTorch dependency headache.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.