NVIDIA-NeMo/Curator
A GPU-accelerated toolkit for preprocessing and curating training data for large language models.

Velocity · 7d
+2.0
★ / day
Trend
→steady
star history
NVIDIA NeMo Curator is a scalable data preprocessing and curation toolkit designed for LLM training. It provides modular pipelines for processing text, images, video, and audio data at scale, from single laptops to multi-node clusters. The toolkit includes deduplication, classification, quality filtering, and semantic deduplication capabilities, all built on Ray for distributed execution.