← all repositories

NVIDIA-NeMo/Curator

A GPU-accelerated toolkit for preprocessing and curating training data for large language models.

1.6k stars Python Data Tooling
Curator
Velocity · 7d
+2.0
★ / day
Trend
steady
star history

NVIDIA NeMo Curator is a scalable data preprocessing and curation toolkit designed for LLM training. It provides modular pipelines for processing text, images, video, and audio data at scale, from single laptops to multi-node clusters. The toolkit includes deduplication, classification, quality filtering, and semantic deduplication capabilities, all built on Ray for distributed execution.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.