LLM data janitor with a visual drag-and-drop conscience
DataFlow turns messy PDFs and raw text into training-ready datasets using composable LLM operators and a PyTorch-like pipeline API.

What it does
DataFlow is a Python framework for generating, cleaning, and filtering data for LLM training. It ingests noisy sources—PDFs, plain text, low-quality QA—and runs them through reusable “operators” (think: data processing layers) to produce structured training datasets. It also ships with a WebUI for visual pipeline building via dataflow webui.
The interesting bit The project borrows its mental model from PyTorch: a clear Pipeline → Operator → Prompt hierarchy that makes it easy to swap models, compare data governance strategies, and publish reproducible workflows. There’s even a DataFlow-Agent that dynamically assembles pipelines from high-level intent, which is either genuinely useful or a very committed side quest.
Key highlights
- 100+ pre-built operators for generation, evaluation, filtering, and refinement
- Ready-made pipelines for text, math, code, PDF→QA, and Text2SQL workflows
- WebUI with drag-and-drop operator composition
- Distributed execution via RayOrch for large-scale jobs
- Docker-ready, Colab notebooks, and PyPI install (
open-dataflow) - Academic backing: ICDE 2026 and KDD 2026 accepted papers
Caveats
- The README is enthusiastic but vague on actual performance numbers or benchmark comparisons against Nemo-Curator and Data-Juicer
- Several ecosystem components (DataFlow-Agent, DataFlow-MM) live in separate repos, so the “unified” experience requires some assembly
- Documentation links in the README appear truncated or broken in places
Verdict Worth a look if you’re building domain-specific LLMs and need a structured, shareable approach to data curation. Skip it if you just need quick one-off cleaning—this is infrastructure, not a script.