intelligent-machine-learning/dlrover
An automatic distributed deep learning system that orchestrates large AI model training on Kubernetes and Ray clusters.

Velocity · 7d
+1.2
★ / day
Trend
→steady
star history
DLRover automates distributed deep learning training workflows, handling resource orchestration, fault-tolerance, and auto-scaling for large AI models. It integrates with Kubernetes and Ray to manage training jobs, supporting features like Flash Checkpoint for rapid failure recovery and XPU Timer for runtime diagnostics. The system enables model developers to focus on architecture while abstracting away distributed training complexities.