← all repositories
tensorflow/mesh

TensorFlow's answer to 'where does my 5B-parameter model fit?'

A Python layer that lets you describe distributed training by naming tensor dimensions and mapping them to processor grids, then mechanically lowers the whole thing into TensorFlow.

mesh
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

Mesh TensorFlow is a Python library that sits on top of TensorFlow and lets you write a model once, then decide later how to slice it across processors. You build an mtf.Graph with named dimensions—("batch", 100), ("hidden", 1024)—and define layout rules like ("batch", "processor_rows"). A Lowering pass turns that abstract graph into concrete TensorFlow ops, inserting all-reduce communication where dimensions are split and need reconciling. It handles data parallelism, model parallelism, or both tiled together on an n-dimensional mesh of processors.

The interesting bit

The layout is purely a performance knob: the README explicitly states that “layouts do not affect results—only performance.” This means you can experiment with distribution strategies without touching model code. There’s even an auto_mtf subpackage that tries to pick a layout for you, which is the rare kind of automation that doesn’t pretend to be magic—it just solves an ILP based on your graph and mesh shape.

Key highlights

  • Named dimensions (batch, hidden, rows) are the core abstraction; tensor shapes are tuples of (name, size) pairs, not anonymous axes.
  • A Mesh is an n-dimensional array of processors; tensors are split and/or replicated across it according to global layout rules.
  • Supports hybrid parallelism out of the box—e.g., split batch across rows of GPUs and hidden units across columns.
  • Includes auto_mtf for automatic layout selection; manual tuning is still possible for “advanced users” who want to “eke out additional performance.”
  • Implemented as a layer over TensorFlow, generating standard TF graph operations plus collective communication.

Caveats

  • The README carries a blunt disclaimer: “If you just want data-parallel training (batch-splitting), then you do not need Mesh TensorFlow.”
  • Version is v0.0, and there’s a visible TODO(noam): verify that this code works in the MNIST example.
  • Requires careful attention to chunk sizes; the docs warn that small splits make you bandwidth-bound.

Verdict

Worth a look if you’re training models that don’t fit on one GPU or TPU—especially large language models or 3D image networks—and you want to reason about parallelism in terms of named axes rather than manual device placement. Skip it if standard data parallelism already handles your model size; the abstraction has overhead and the project is explicitly overkill for that case.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.