← all repositories

intelligent-machine-learning/dlrover

An automatic distributed deep learning system that orchestrates large AI model training on Kubernetes and Ray clusters.

dlrover
Velocity · 7d
+1.2
★ / day
Trend
steady
star history

DLRover automates distributed deep learning training workflows, handling resource orchestration, fault-tolerance, and auto-scaling for large AI models. It integrates with Kubernetes and Ray to manage training jobs, supporting features like Flash Checkpoint for rapid failure recovery and XPU Timer for runtime diagnostics. The system enables model developers to focus on architecture while abstracting away distributed training complexities.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.