data-prep-kit/data-prep-kit
A scalable data preparation pipeline for preprocessing unstructured data to support LLM training, fine-tuning, and RAG applications.

Velocity · 7d
+1.2
★ / day
Trend
→steady
star history
Data Prep Kit provides scalable transforms and recipes to process unstructured data for large language model workflows. It supports multiple execution backends including Ray and Spark, enabling data curation tasks such as deduplication, text cleansing, and enrichment. The toolkit is specifically designed for pre-training, fine-tuning, instruct-tuning, and building RAG applications for LLMs.