Is mpi-operator open source?

Yes — kubeflow/mpi-operator is open source, released under the Apache-2.0 license.

What language is mpi-operator written in?

kubeflow/mpi-operator is primarily written in Go.

How popular is mpi-operator?

kubeflow/mpi-operator has 530 stars on GitHub.

Where can I find mpi-operator?

kubeflow/mpi-operator is on GitHub at https://github.com/kubeflow/mpi-operator.

← all repositories

kubeflow/mpi-operator

Kubernetes learns MPI: distributed training without the scheduling headache

A Kubernetes operator that turns allreduce-style distributed training into a declarative YAML file, handling the messy pod orchestration so you don't have to.

★530 stars Go ML Frameworks LLMOps · Eval

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does The MPI Operator is a Kubernetes controller that manages MPIJob custom resources. You define a YAML with launcher and worker specs; the operator spins up pods, wires them together, and runs your mpirun command. It targets the classic MPI pattern—one launcher orchestrating multiple workers for allreduce-style distributed training—common in TensorFlow, PyTorch, and Horovod workflows.

The interesting bit The operator abstracts away the tedious parts of MPI on Kubernetes: SSH key distribution, hostfile generation, and ensuring workers are ready before the launcher fires. It also exposes Prometheus metrics for job lifecycle events, so you can track created, successful, and failed jobs without building your own instrumentation.

Key highlights

Supports multiple MPI implementations: Open MPI, Intel MPI, and MPICH
GPU-aware scheduling via standard Kubernetes resource limits (nvidia.com/gpu)
Configurable cleanPodPolicy for pod cleanup behavior after job completion
Prometheus metrics exposed for job tracking and kube-state-metrics integration
Part of the broader Kubeflow ecosystem, with installation via raw manifests or kustomize overlays

Caveats

The README examples still reference older API versions and Kubernetes features (e.g., kubectl kustomize vs. kubectl apply -k); some copy-paste may need adjustment for modern clusters
Documentation on advanced scheduling, fault tolerance, or gang scheduling is thin in the README itself
The project has modest adoption (528 stars) relative to the broader Kubeflow ecosystem

Verdict Worth a look if you’re already running Kubernetes and want to run Horovod-style distributed training without hand-rolling MPI infrastructure. Skip it if you’re on a managed ML platform (SageMaker, Vertex AI, etc.) or if your workloads don’t fit the launcher-worker pattern.

Frequently asked

What is kubeflow/mpi-operator?: A Kubernetes operator that turns allreduce-style distributed training into a declarative YAML file, handling the messy pod orchestration so you don't have to.
Is mpi-operator open source?: Yes — kubeflow/mpi-operator is open source, released under the Apache-2.0 license.
What language is mpi-operator written in?: kubeflow/mpi-operator is primarily written in Go.
How popular is mpi-operator?: kubeflow/mpi-operator has 530 stars on GitHub.
Where can I find mpi-operator?: kubeflow/mpi-operator is on GitHub at https://github.com/kubeflow/mpi-operator.