A 0.9B-parameter model that actually reads your messy documents
GLM-OCR squeezes document understanding into a sub-1B model with a layout-aware pipeline and enough deployment options to please any ops team.

What it does GLM-OCR turns images and PDFs into structured text—tables, formulas, code blocks, seals, and all. It runs a two-stage pipeline: PP-DocLayout-V3 first chops the page into regions, then a tiny 0.9B vision-language model reads each region in parallel and spits out Markdown plus JSON layout metadata.
The interesting bit
The model itself is almost comically small for the task—0.5B language decoder, CogViT visual encoder, some token downsampling sleight-of-hand—yet the authors claim #1 on OmniDocBench V1.5 (94.62) and competitive results on formula and table benchmarks. The SDK is the real product: one pip install glmocr, a config flag for cloud or self-hosted, and you’re parsing directories from CLI or wrapping it in a Flask service.
Key highlights
- MaaS mode: cloud API wrapper, zero GPU, one YAML key
- Self-hosted: vLLM, SGLang, Ollama, MLX/Apple Silicon, or SDK server + GPU-less client
- Modular pipeline:
PageLoader→PPDocLayoutDetector→OCRClient→ResultFormatter, swappable if you need custom preprocessing - Speculative decoding: both vLLM and SGLang configs ship with MTP/NEXTN speculative tokens for latency reduction
- Fine-tuning: LLaMA-Factory tutorial provided
Caveats
- The “state-of-the-art” claims are from the authors’ own technical report; no independent verification cited
- License soup: code is Apache 2.0, model weights are MIT, but PP-DocLayoutV3 is also Apache 2.0—compliance is on you
- Self-hosted setup requires juggling layout model placement (CPU vs. GPU) and context-length tuning for large PDFs
Verdict Worth a look if you’re building document pipelines and want something lighter than throwing GPT-4V at every page. Skip it if you need battle-tested, vendor-neutral OCR with audited benchmarks; the ecosystem here is young and Zhipu-hosted.