← all repositories
flexflow/flexflow-train

A framework that stops you from hand-tuning GPU parallelization

FlexFlow Train searches for fast distributed DNN training strategies so you don't have to guess at data, model, or parameter parallelism.

1.9k stars C++ ML FrameworksLLMOps · Eval
flexflow-train
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

What it does FlexFlow Train is a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies. It explores combinations of data, model, and parameter parallelism via MCMC search, then applies the best discovered strategy to your training run.

The interesting bit The project treats parallelization as a search problem rather than a configuration headache. It can export and import strategies, so you pay the search cost once and reuse the result — a practical concession to the reality that autotuning is expensive.

Key highlights

  • Built on the Legion runtime; targets multi-GPU and multi-node setups
  • Supports PyTorch model import via torch_to_flexflow, plus Keras and ONNX frontends
  • C++ and Python APIs available
  • Search budget and parallelism modes are configurable via CLI flags
  • Backed by OSDI ‘22 and MLSys ‘19 research papers

Caveats

  • The README is currently stripped down; installation instructions, Docker details, and most usage examples are commented out or missing
  • The repository recently split from flexflow-serve, so docs and links may still be in flux
  • Pre-built Docker containers were noted as CUDA 11.7-only, but this detail is currently hidden in comments

Verdict Worth a look if you’re running large-scale training and suspect your current parallelization strategy is leaving GPU cycles on the table. Skip it if you want a polished, batteries-included framework — this is research infrastructure with academic roots.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.