Is RedPajama-Data open source?

Yes — togethercomputer/RedPajama-Data is open source, released under the Apache-2.0 license.

What language is RedPajama-Data written in?

togethercomputer/RedPajama-Data is primarily written in Python.

How popular is RedPajama-Data?

togethercomputer/RedPajama-Data has 5k stars on GitHub.

Where can I find RedPajama-Data?

togethercomputer/RedPajama-Data is on GitHub at https://github.com/togethercomputer/RedPajama-Data.

← all repositories

togethercomputer/RedPajama-Data

Turning 84 CommonCrawl Dumps into LLM-Ready Training Data

This is the heavy machinery that scrubs, scores, and deduplicates CommonCrawl into the 30-trillion-token RedPajama-V2 dataset.

★5k stars Python Data Tooling Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

RedPajama-Data is the preprocessing pipeline behind the RedPajama-V2 dataset. It ingests over 100 billion raw documents from 84 CommonCrawl snapshots, runs them through the CCNet pipeline, and produces a filtered corpus with quality annotations. The resulting head_middle subset alone contains 20.8 billion deduplicated documents across five European languages, estimated at 30.4 trillion tokens.

The interesting bit

The pipeline treats data cleaning as a three-stage factory line: artifact preparation, quality-signal computation, and deduplication. The quality stage emits a broad set of annotations—perplexity scores, language identification, length metrics, and more—while simultaneously generating MinHash signatures for the final stage. Fuzzy deduplication then uses Polars-based locality-sensitive hashing; the authors note this was tested on 200 million documents on a 64-core machine with 500 GB of RAM, so it is not shy about hardware appetite.

Key highlights

Covers five languages: English, German, French, Spanish, and Italian
Produces 30B documents annotated with quality signals and 20B deduplicated documents
Exact deduplication uses a memory-mapped Bloom filter (pybloomfiltermmap3) with configurable capacity and error rates
Fuzzy deduplication relies on LSH over MinHash signatures generated during the quality-signal step
Built around Docker/Apptainer containers and expects S3-backed data for the exact-dedup stage

Caveats

The running scripts assume Docker and Apptainer are available, though the steps can technically run without containers
Exact deduplication via Bloom filter requires the source data to live in an S3 bucket
The README warns that PYTHONHASHSEED must be pinned to a fixed value when not using the provided container scripts, or the hash functions used in DSIR weight computation will become inconsistent

Verdict

Worth a look if you are assembling a data-processing stack for foundation-model training and need a reference implementation for web-corpus cleaning at scale. Skip it if you just want the finished dataset; the processed data is already on HuggingFace, and this repository is strictly the plumbing.

Frequently asked

What is togethercomputer/RedPajama-Data?: This is the heavy machinery that scrubs, scores, and deduplicates CommonCrawl into the 30-trillion-token RedPajama-V2 dataset.
Is RedPajama-Data open source?: Yes — togethercomputer/RedPajama-Data is open source, released under the Apache-2.0 license.
What language is RedPajama-Data written in?: togethercomputer/RedPajama-Data is primarily written in Python.
How popular is RedPajama-Data?: togethercomputer/RedPajama-Data has 5k stars on GitHub.
Where can I find RedPajama-Data?: togethercomputer/RedPajama-Data is on GitHub at https://github.com/togethercomputer/RedPajama-Data.