← all repositories
kubeflow/mpi-operator

Kubernetes learns MPI: distributed training without the scheduling headache

A Kubernetes operator that turns allreduce-style distributed training into a declarative YAML file, handling the messy pod orchestration so you don't have to.

mpi-operator
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does The MPI Operator is a Kubernetes controller that manages MPIJob custom resources. You define a YAML with launcher and worker specs; the operator spins up pods, wires them together, and runs your mpirun command. It targets the classic MPI pattern—one launcher orchestrating multiple workers for allreduce-style distributed training—common in TensorFlow, PyTorch, and Horovod workflows.

The interesting bit The operator abstracts away the tedious parts of MPI on Kubernetes: SSH key distribution, hostfile generation, and ensuring workers are ready before the launcher fires. It also exposes Prometheus metrics for job lifecycle events, so you can track created, successful, and failed jobs without building your own instrumentation.

Key highlights

  • Supports multiple MPI implementations: Open MPI, Intel MPI, and MPICH
  • GPU-aware scheduling via standard Kubernetes resource limits (nvidia.com/gpu)
  • Configurable cleanPodPolicy for pod cleanup behavior after job completion
  • Prometheus metrics exposed for job tracking and kube-state-metrics integration
  • Part of the broader Kubeflow ecosystem, with installation via raw manifests or kustomize overlays

Caveats

  • The README examples still reference older API versions and Kubernetes features (e.g., kubectl kustomize vs. kubectl apply -k); some copy-paste may need adjustment for modern clusters
  • Documentation on advanced scheduling, fault tolerance, or gang scheduling is thin in the README itself
  • The project has modest adoption (528 stars) relative to the broader Kubeflow ecosystem

Verdict Worth a look if you’re already running Kubernetes and want to run Horovod-style distributed training without hand-rolling MPI infrastructure. Skip it if you’re on a managed ML platform (SageMaker, Vertex AI, etc.) or if your workloads don’t fit the launcher-worker pattern.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.