← all repositories
kubeflow/trainer

Kubernetes that finally speaks GPU

A Kubernetes-native operator that turns multi-node, multi-GPU training jobs from a scheduling nightmare into a declarative YAML file.

trainer
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

Kubeflow Trainer is a Kubernetes operator for distributed AI training and LLM fine-tuning. You define a TrainJob with a runtime (PyTorch, JAX, XGBoost, MPI, etc.) and the controller handles the pod topology, GPU placement, and inter-node communication. It also includes a distributed data cache for zero-copy data streaming to GPU nodes.

The interesting bit

The project merged several older Kubeflow operators (PyTorch, MPI, XGBoost) into a single unified API, then layered on HPC-grade MPI orchestration and topology-aware scheduling via Kueue. It’s essentially trying to be the “one CRD to rule them all” for ML training on Kubernetes — a notoriously crowded space where most tools pick one framework and call it done.

Key highlights

  • Supports PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and MPI/Flux Framework runtimes
  • Integrates with Kueue for topology-aware scheduling and multi-cluster job dispatching
  • Distributed data cache with zero-copy transfer to GPU nodes
  • Python SDK (TrainJob and Runtime APIs) for practitioners who’d rather not hand-write YAML
  • Official PyTorch ecosystem project since July 2025

Caveats

  • APIs are alpha and may change; V1 users need to migrate
  • The README’s “seamlessly integrates” and “effortlessly develop” claims are marketing seasoning — actual complexity depends on your cluster setup

Verdict

Worth evaluating if you’re already running Kubernetes at scale and tired of juggling separate operators per framework. Skip it if you’re on managed training platforms (SageMaker, Vertex, etc.) or need API stability guarantees today.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.