One model to OCR them all: plain text, formulas, and fine-grained regions
A unified vision-language model that treats OCR as generation rather than pipeline engineering.

What it does
GOT-OCR2.0 is a single vision-language model built on Qwen that reads images and generates text, formulas, and structured output directly. It handles plain OCR, formatted text (think LaTeX, tables), fine-grained region extraction by bounding box or color, and even multi-page documents via token-level processing. The project ships weights, training scripts for further post-training, and evaluation code against Fox and OneChart benchmarks.
The interesting bit
Instead of the classic OCR pipeline—detect, recognize, post-process—this treats everything as a sequence-to-sequence problem. The model renders formatted output to HTML so you can actually see if it reconstructed the layout correctly. There’s also a whole ecosystem of community ports: ONNX, MNN, llama.cpp, vLLM, OpenVINO, even a CPU-only fork.
Key highlights
- Supports plain OCR, formatted OCR, fine-grained box/color extraction, and multi-crop/multi-page modes
- Weights available via Hugging Face, Google Drive, and BaiduYun; Hugging Face downloads exceeded 1M as of December 2024
- Fine-tuning via ms-swift with LoRA, or full post-training (stages 2–3) with DeepSpeed on provided weights
- Now merged into Hugging Face Transformers with batched inference support (February 2025)
- Training from scratch requires a separate repo (Vary-tiny-600k); this codebase only resumes from released weights
Caveats
- Multi-page OCR is explicitly not batch inference; it works token-level and the README warns you to read the paper first or misuse it
- Stage-1 pretraining isn’t included here—you’ll need the Vary-tiny-600k repo if you want to train from scratch rather than fine-tune
Verdict
Worth a look if you’re tired of gluing together detection, recognition, and layout analysis pipelines. Skip it if you need a lightweight, edge-first OCR solution; this is a 7B-parameter vision-language model that expects CUDA and Flash Attention.