Is data-prep-kit open source?

Yes — data-prep-kit/data-prep-kit is open source, released under the Apache-2.0 license.

What language is data-prep-kit written in?

data-prep-kit/data-prep-kit is primarily written in HTML.

How popular is data-prep-kit?

data-prep-kit/data-prep-kit has 949 stars on GitHub.

Where can I find data-prep-kit?

data-prep-kit/data-prep-kit is on GitHub at https://github.com/data-prep-kit/data-prep-kit.

← all repositories

data-prep-kit/data-prep-kit

Unstructured data curation from laptop to data center

Modular transforms that cleanse and enrich unstructured text, code, and images for LLM training and RAG, scaling from a laptop to a Kubernetes cluster.

★949 stars HTML Data Tooling RAG · Search

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does Data Prep Kit is a collection of pluggable transforms—deduplication, PII redaction, document chunking, malware detection, and image filtering among them—that prepare raw unstructured data for LLM pre-training, fine-tuning, or RAG applications. Hosted under the Linux Foundation, it offers both Python-native and Ray-distributed runtimes, so the same transform can run locally or scale out to a data center. End-to-end recipes and Tekton pipeline support are included for chaining transforms into production workflows.

The interesting bit Rather than treating text, code, and images as afterthoughts, the kit makes all three first-class modalities with dedicated transforms and shared infrastructure for Parquet, ZIP, NDJSON, and JSONL. That universality is rarer than it sounds—most data-cleaning tools pick one domain and stay there.

Key highlights

Ingestion support spans HTML, PDF via Docling, code archives, and web crawls, with output normalized to Parquet.
Filtering and enrichment include exact and fuzzy deduplication, blocklists, quality scoring, language identification, tokenization, and HAP (hate/abuse/profanity) detection.
Code-specific transforms cover programming-language annotation, header cleansing, license selection, and semantic file ordering; image transforms include face and NSFW detection.
The framework allows authoring custom transforms, and pre-built recipes demonstrate fine-tuning and RAG pipelines.

Verdict Worth a look if you need a governed, multi-modal data curation layer that can grow from a Colab notebook to a Kubernetes deployment. Less useful if your data pipeline is already a bespoke, single-purpose script you are happy to maintain.

Frequently asked

What is data-prep-kit/data-prep-kit?: Modular transforms that cleanse and enrich unstructured text, code, and images for LLM training and RAG, scaling from a laptop to a Kubernetes cluster.
Is data-prep-kit open source?: Yes — data-prep-kit/data-prep-kit is open source, released under the Apache-2.0 license.
What language is data-prep-kit written in?: data-prep-kit/data-prep-kit is primarily written in HTML.
How popular is data-prep-kit?: data-prep-kit/data-prep-kit has 949 stars on GitHub.
Where can I find data-prep-kit?: data-prep-kit/data-prep-kit is on GitHub at https://github.com/data-prep-kit/data-prep-kit.