Is grobid open source?

Yes — grobidOrg/grobid is open source, released under the Apache-2.0 license.

What language is grobid written in?

grobidOrg/grobid is primarily written in Java.

How popular is grobid?

grobidOrg/grobid has 5k stars on GitHub.

Where can I find grobid?

grobidOrg/grobid is on GitHub at https://github.com/grobidOrg/grobid.

← all repositories

grobidOrg/grobid

Turning PDF chaos into structured science since 2008

A battle-tested Java toolkit that extracts metadata, references, and full text from academic PDFs using a cascade of ML models.

★5k stars Java Domain Apps Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does GROBID is a Java-based document parsing pipeline that takes raw scientific PDFs and outputs structured XML/TEI. It extracts headers, bibliographic references, citation contexts, full text body structures, funding information, and even copyright licenses — all with bounding-box coordinates for interactive overlays.

The interesting bit The architecture is a cascade of sequence-labeling models (CRF by default, optional RNN or transformer-based deep learning via JEP) that jointly consume text and visual layout features from pdfalto. The project explicitly warns that its default CRF configuration is the easy path, while deep learning models “perform significantly better” — particularly for reference parsing — if you have the GPU to feed them.

Key highlights

Production deployments include ResearchGate, Semantic Scholar, HAL, scite.ai, Academia.edu, CERN, and Internet Archive Scholar
Reference parsing hits ~0.90 F1 on bioRxiv with deep learning models; DOI/PMID resolution exceeds 0.95 F1 when consolidated against CrossRef
Full-text throughput benchmarked at ~10.6 PDFs/second (915K/day) via multi-threaded Node.js client on a 16-core machine
68 distinct labels for fine-grained structure, from author middle names to figure captions
Docker images, web service API, batch processing, and clients in Python/Java/Node.js/Go

Caveats

Windows support is currently not guaranteed (“help welcome!”)
Deep learning models are disabled by default; enabling them requires Python 3.10–3.11, JEP, and ideally NVIDIA CUDA
Demo servers have quota limits and run CPU-only, so they’re genuinely just for testing

Verdict Essential infrastructure if you’re building scholarly search, bibliometric analysis, or automated literature review pipelines at scale. Skip it if you only need occasional PDF text extraction — this is specialized tooling with specialized setup costs.

Frequently asked

What is grobidOrg/grobid?: A battle-tested Java toolkit that extracts metadata, references, and full text from academic PDFs using a cascade of ML models.
Is grobid open source?: Yes — grobidOrg/grobid is open source, released under the Apache-2.0 license.
What language is grobid written in?: grobidOrg/grobid is primarily written in Java.
How popular is grobid?: grobidOrg/grobid has 5k stars on GitHub.
Where can I find grobid?: grobidOrg/grobid is on GitHub at https://github.com/grobidOrg/grobid.