togethercomputer/RedPajama-Data
RedPajama-Data provides the data processing pipeline for the RedPajama-V2 open dataset containing 30 trillion tokens for LLM training.

Velocity · 7d
+4.3
★ / day
Trend
→steady
star history
This repository contains the code for preparing the RedPajama-V2 dataset, an open 30-trillion-token corpus for training large language models. The pipeline processes over 100B text documents from 84 CommonCrawl snapshots using the CCNet pipeline, applying quality signals and deduplication across multiple languages. It supports Docker-based deployment with configurable processing pipelines.