← all repositories

GAIR-NLP/MathPile

A 9.5B-token math-centric pre-training corpus for training LLMs on mathematical reasoning.

418 stars Python Data ToolingLanguage Models
MathPile
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

MathPile is a large-scale pre-training corpus designed for training language models on mathematical content. Released at NeurIPS D&B 2024, it contains approximately 9.5 billion tokens of high-quality, diverse math-centric data from multiple sources including web math, textbooks, and arXiv papers. The repository includes data processing scripts used to construct and clean the corpus, enabling researchers to reproduce the dataset or apply similar pipelines to other domains.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.