allenai/dolma
A 3-trillion-token open dataset and high-performance toolkit for curating language model pre-training data.

Velocity · 7d
+1.4
★ / day
Trend
→steady
star history
Dolma provides an open dataset of 3 trillion tokens sourced from web content, academic publications, code, books, and encyclopedic materials. The accompanying toolkit enables high-performance dataset curation for language model training with built-in parallelism and ready-to-use taggers for common filtering strategies like Gopher, C4, and OpenWebText.