GAIR-NLP/MathPile
A 9.5B-token math-centric pre-training corpus for training LLMs on mathematical reasoning.

Velocity · 7d
+0.5
★ / day
Trend
→steady
star history
MathPile is a large-scale pre-training corpus designed for training language models on mathematical content. Released at NeurIPS D&B 2024, it contains approximately 9.5 billion tokens of high-quality, diverse math-centric data from multiple sources including web math, textbooks, and arXiv papers. The repository includes data processing scripts used to construct and clean the corpus, enabling researchers to reproduce the dataset or apply similar pipelines to other domains.