Ucas-HaoranWei/GOT-OCR2.0

One model to OCR them all: plain text, formulas, and fine-grained regions

A unified vision-language model that treats OCR as generation rather than pipeline engineering.

★8.1k stars Python Computer Vision

View on GitHub ↗

Velocity · 7d

+13

★ / day

Trend

→steady

star history

What it does

GOT-OCR2.0 is a single vision-language model built on Qwen that reads images and generates text, formulas, and structured output directly. It handles plain OCR, formatted text (think LaTeX, tables), fine-grained region extraction by bounding box or color, and even multi-page documents via token-level processing. The project ships weights, training scripts for further post-training, and evaluation code against Fox and OneChart benchmarks.

The interesting bit

Instead of the classic OCR pipeline—detect, recognize, post-process—this treats everything as a sequence-to-sequence problem. The model renders formatted output to HTML so you can actually see if it reconstructed the layout correctly. There’s also a whole ecosystem of community ports: ONNX, MNN, llama.cpp, vLLM, OpenVINO, even a CPU-only fork.

Key highlights

Supports plain OCR, formatted OCR, fine-grained box/color extraction, and multi-crop/multi-page modes
Weights available via Hugging Face, Google Drive, and BaiduYun; Hugging Face downloads exceeded 1M as of December 2024
Fine-tuning via ms-swift with LoRA, or full post-training (stages 2–3) with DeepSpeed on provided weights
Now merged into Hugging Face Transformers with batched inference support (February 2025)
Training from scratch requires a separate repo (Vary-tiny-600k); this codebase only resumes from released weights

Caveats

Multi-page OCR is explicitly not batch inference; it works token-level and the README warns you to read the paper first or misuse it
Stage-1 pretraining isn’t included here—you’ll need the Vary-tiny-600k repo if you want to train from scratch rather than fine-tune

Verdict

Worth a look if you’re tired of gluing together detection, recognition, and layout analysis pipelines. Skip it if you need a lightweight, edge-first OCR solution; this is a 7B-parameter vision-language model that expects CUDA and Flash Attention.