← all repositories
OpenDCAI/DataFlow

LLM data janitor with a visual drag-and-drop conscience

DataFlow turns messy PDFs and raw text into training-ready datasets using composable LLM operators and a PyTorch-like pipeline API.

4.6k stars Python Data Tooling
DataFlow
Velocity · 7d
+7.7
★ / day
Trend
steady
star history

What it does DataFlow is a Python framework for generating, cleaning, and filtering data for LLM training. It ingests noisy sources—PDFs, plain text, low-quality QA—and runs them through reusable “operators” (think: data processing layers) to produce structured training datasets. It also ships with a WebUI for visual pipeline building via dataflow webui.

The interesting bit The project borrows its mental model from PyTorch: a clear Pipeline → Operator → Prompt hierarchy that makes it easy to swap models, compare data governance strategies, and publish reproducible workflows. There’s even a DataFlow-Agent that dynamically assembles pipelines from high-level intent, which is either genuinely useful or a very committed side quest.

Key highlights

  • 100+ pre-built operators for generation, evaluation, filtering, and refinement
  • Ready-made pipelines for text, math, code, PDF→QA, and Text2SQL workflows
  • WebUI with drag-and-drop operator composition
  • Distributed execution via RayOrch for large-scale jobs
  • Docker-ready, Colab notebooks, and PyPI install (open-dataflow)
  • Academic backing: ICDE 2026 and KDD 2026 accepted papers

Caveats

  • The README is enthusiastic but vague on actual performance numbers or benchmark comparisons against Nemo-Curator and Data-Juicer
  • Several ecosystem components (DataFlow-Agent, DataFlow-MM) live in separate repos, so the “unified” experience requires some assembly
  • Documentation links in the README appear truncated or broken in places

Verdict Worth a look if you’re building domain-specific LLMs and need a structured, shareable approach to data curation. Skip it if you just need quick one-off cleaning—this is infrastructure, not a script.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.