← all repositories

allenai/dolma

A 3-trillion-token open dataset and high-performance toolkit for curating language model pre-training data.

1.5k stars Python Data ToolingLanguage Models
dolma
Velocity · 7d
+1.4
★ / day
Trend
steady
star history

Dolma provides an open dataset of 3 trillion tokens sourced from web content, academic publications, code, books, and encyclopedic materials. The accompanying toolkit enables high-performance dataset curation for language model training with built-in parallelism and ready-to-use taggers for common filtering strategies like Gopher, C4, and OpenWebText.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.