← all repositories

data-prep-kit/data-prep-kit

A scalable data preparation pipeline for preprocessing unstructured data to support LLM training, fine-tuning, and RAG applications.

937 stars HTML Data ToolingRAG · Search
data-prep-kit
Velocity · 7d
+1.2
★ / day
Trend
steady
star history

Data Prep Kit provides scalable transforms and recipes to process unstructured data for large language model workflows. It supports multiple execution backends including Ray and Spark, enabling data curation tasks such as deduplication, text cleansing, and enrichment. The toolkit is specifically designed for pre-training, fine-tuning, instruct-tuning, and building RAG applications for LLMs.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.