A 200-operator blender for foundation-model data
Data-Juicer turns raw multimodal chaos into training-ready datasets with YAML recipes instead of glue code.

What it does Data-Juicer is a Python framework for cleaning, synthesizing, and analyzing data across the full AI lifecycle—pre-training, fine-tuning, RL, RAG, and agent traces. You chain 200+ built-in operators (text, image, audio, video, multimodal) via YAML “recipes” or compose them directly in Python. It runs on a laptop or scales to thousand-node Ray clusters.
The interesting bit The project treats data processing as composable infrastructure—versionable, forkable pipelines rather than one-off scripts. It also bakes in the unglamorous but critical bits: automatic operator fusion (2-10× speedup), hot-reload for iteration, and built-in tracing for debugging distributed runs.
Key highlights
- 200+ operators covering LaTeX parsing, video undistortion, 3D body mesh recovery, semantic chunking, deduplication, and more
- Recipe-first: declarative YAML pipelines you can git-version and share
- Performance claims: 70B samples in 2h on 50 Ray nodes (6400 cores); 5TB deduplicated in 2.8h on 1280 cores
- Deep ecosystem integration: Alibaba Cloud PAI, Hugging Face, Delta Lake, Iceberg, NeMo, LLaMA-Factory, and others
- NeurIPS'25 Spotlight for Data-Juicer 2.0; active academic and industry adoption (ByteDance, NVIDIA, Xiaomi, Tsinghua, etc.)
Caveats
- The README’s performance numbers lack reproducible benchmarks or links to detailed methodology
- “200+ operators” and “50+ recipes” are impressive but also suggest a learning curve; you’ll need to dig through docs to find what you need
- Some advanced features (Ray vLLM pipelines, embodied-AI video ops) appear to require specific hardware/software stacks not fully detailed in the truncated README
Verdict Worth a look if you’re building or curating large multimodal datasets for LLMs or VLMs and want to escape ad-hoc preprocessing scripts. Probably overkill if your data fits in a Pandas DataFrame and your pipeline is three filters.