Microsoft's AI cluster manager is now in maintenance-only mode
OpenPAI was built to share GPU farms among teams, but the repo has gone read-only after v1.8.1.
What it does
OpenPAI is a Kubernetes-based platform for sharing AI compute resources — GPUs, FPGAs, InfiniBand — across teams. It wraps job scheduling, user management, storage, and pre-built Docker images for TensorFlow, PyTorch, and friends into a single deployable stack. Administrators manage nodes through a web portal and a paictl CLI; users submit training jobs without worrying about the hardware underneath.
The interesting bit
The project carries Microsoft’s “proven track record in large-scale production environment” — a rare claim of battle-tested lineage in open-source cluster tooling. It also shed its Hadoop YARN roots in v1.0, migrating fully to Kubernetes with a custom HiveD scheduler for GPU-aware placement.
Key highlights
- Supports on-premises, hybrid, cloud, or single-box deployment
- Modular architecture: marketplace, VS Code extension, SDK, and runtime are separate repos you can swap in or out
- Pre-built containers for popular frameworks; distributed training ready
- Virtual clusters for multi-tenant resource isolation
- End-to-end manuals for both administrators and end users
Caveats
- The repository is read-only as of v1.8.1 (December 2021); no major features planned, and collaboration requires contacting repo admins directly
- The README’s upgrade table references v1.0.0 as “latest” but the banner says v1.8.1 is the actual final release — documentation drift is visible
Verdict
Worth studying if you’re building an internal GPU-sharing platform and want to see how Microsoft solved user quotas, job orchestration, and framework containerization. Not worth adopting fresh unless you plan to fork and maintain it yourself.