Is olmocr open source?

Yes — allenai/olmocr is open source, released under the Apache-2.0 license.

What language is olmocr written in?

allenai/olmocr is primarily written in Python.

How popular is olmocr?

allenai/olmocr has 19.2k stars on GitHub and is currently cooling off.

Where can I find olmocr?

allenai/olmocr is on GitHub at https://github.com/allenai/olmocr.

← all repositories

allenai/olmocr

Turning PDF chaos into LLM training fuel with a 7B VLM

olmOCR exists because LLMs cannot train on PDFs until someone strips the formatting chaos and restores natural reading order.

★19.2k stars Python Data Tooling

View on GitHub ↗

Velocity · 7d

+12

★ / day

Trend

↘cooling

star history

What it does

olmOCR takes PDFs, PNGs, and JPEGs—complete with equations, tables, handwriting, and multi-column layouts—and converts them into clean Markdown. It strips headers and footers automatically and attempts to impose a natural reading order even when figures or insets get in the way. The output is meant for machine consumption first: it is built specifically to feed LLM datasets and pre-training pipelines, not just to make a document human-readable.

The interesting bit

The project treats document conversion as an infrastructure problem, not a side hobby. AI2 ships a rigorous benchmark suite—olmOCR-Bench—with over 7,000 test cases across 1,400 documents to keep score, and the latest v0.4.0 model was trained with synthetic data and reinforcement learning, techniques usually reserved for chat models rather than document parsers. The efficiency claim is almost as notable as the accuracy one: the maintainers say it costs less than $200 to process a million pages.

Key highlights

Handles complex layouts—multi-column, insets, figures, equations, and tables—without manual template tuning.
Competitive benchmark scores: v0.4.0 scores 82.4 overall on olmOCR-Bench, trading blows with larger or specialized rivals.
Flexible deployment: runs fully offline on a recent NVIDIA GPU with 12GB+ of VRAM, or works as a lightweight client against a remote vLLM/OpenAI-compatible server.
Ships with its own evaluation suite and training code, so you can fine-tune the 7B model rather than treating it as a black box.
Automatically removes headers and footers as part of the standard pipeline.

Caveats

Local inference demands real hardware: a recent NVIDIA GPU with at least 12GB of VRAM and 30GB of free disk space.
The benchmark table shows it trails some competitors like Chandra OCR (83.1) and Infinity-Parser (82.5) overall, so it is not the undisputed accuracy leader.
It is not a pure-Python library; you will need system dependencies such as poppler-utils and additional font packages.

Verdict

Data engineers building LLM pre-training corpora should take a close look; if you only need to OCR the occasional single-page scan, the GPU requirement and pipeline overhead are likely overkill.

Frequently asked

What is allenai/olmocr?: olmOCR exists because LLMs cannot train on PDFs until someone strips the formatting chaos and restores natural reading order.
Is olmocr open source?: Yes — allenai/olmocr is open source, released under the Apache-2.0 license.
What language is olmocr written in?: allenai/olmocr is primarily written in Python.
How popular is olmocr?: allenai/olmocr has 19.2k stars on GitHub and is currently cooling off.
Where can I find olmocr?: allenai/olmocr is on GitHub at https://github.com/allenai/olmocr.