← all repositories

togethercomputer/RedPajama-Data

RedPajama-Data provides the data processing pipeline for the RedPajama-V2 open dataset containing 30 trillion tokens for LLM training.

4.9k stars Python Data ToolingLanguage Models
RedPajama-Data
Velocity · 7d
+4.3
★ / day
Trend
steady
star history

This repository contains the code for preparing the RedPajama-V2 dataset, an open 30-trillion-token corpus for training large language models. The pipeline processes over 100B text documents from 84 CommonCrawl snapshots using the CCNet pipeline, applying quality signals and deduplication across multiple languages. It supports Docker-based deployment with configurable processing pipelines.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.