Is dolma open source?

Yes — allenai/dolma is open source, released under the Apache-2.0 license.

What language is dolma written in?

allenai/dolma is primarily written in Python.

How popular is dolma?

allenai/dolma has 1.5k stars on GitHub.

Where can I find dolma?

allenai/dolma is on GitHub at https://github.com/allenai/dolma.

← all repositories

allenai/dolma

The open data refinery that fed OLMo

It is the open dataset and high-performance curation engine AI2 built to pre-train the OLMo language model.

★1.5k stars Python Data Tooling Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Dolma is two things: a 3 trillion token open corpus drawn from web pages, academic papers, code, books, and encyclopedias, and the toolkit used to curate it. The toolkit parallelizes processing across billions of documents on a single machine, cluster, or cloud. It ships with built-in taggers modeled on filters like Gopher, C4, and OpenWebText, plus a Rust-backed Bloom filter for deduplication.

The interesting bit

Most training corpora are released as static dumps; Dolma pairs its dataset with the exact same open-source pipeline used to produce it, so you can rerun, tweak, or extend the cleaning logic instead of blindly trusting someone else’s work. That transparency is rarer than it should be in LLM land.

Key highlights

3 trillion tokens under ODC-BY license, hosted on HuggingFace Hub
Built-in parallelism for processing billions of documents concurrently
Ready-made taggers for Gopher, C4, and OpenWebText-style curation
Fast document deduplication via a Rust Bloom filter
Extensible architecture with support for custom taggers and S3-compatible storage

Verdict

Data engineers and researchers building open foundation models should look here; if you are just fine-tuning on a few thousand labeled examples, this is overkill.

Frequently asked

What is allenai/dolma?: It is the open dataset and high-performance curation engine AI2 built to pre-train the OLMo language model.
Is dolma open source?: Yes — allenai/dolma is open source, released under the Apache-2.0 license.
What language is dolma written in?: allenai/dolma is primarily written in Python.
How popular is dolma?: allenai/dolma has 1.5k stars on GitHub.
Where can I find dolma?: allenai/dolma is on GitHub at https://github.com/allenai/dolma.