volcano-sh/volcano
A Kubernetes-native batch scheduling system purpose-built for AI/ML/DL training and inference workloads.

Volcano extends the standard Kubernetes scheduler to handle batch and elastic workloads across AI/ML, deep learning, bioinformatics, and big data frameworks. It provides job scheduling, resource management, and gang scheduling specifically optimized for distributed ML training jobs using TensorFlow, PyTorch, Ray, MPI, and similar frameworks. The system serves as orchestration infrastructure for managing GPU clusters and coordinating multi-node ML training workloads on Kubernetes.