← all repositories
microsoft/pai

Microsoft's AI cluster manager is now in maintenance-only mode

OpenPAI was built to share GPU farms among teams, but the repo has gone read-only after v1.8.1.

2.7k stars JavaScript LLMOps · EvalInference · Serving
pai
Velocity · 7d
+0.8
★ / day
Trend
steady
star history

What it does

OpenPAI is a Kubernetes-based platform for sharing AI compute resources — GPUs, FPGAs, InfiniBand — across teams. It wraps job scheduling, user management, storage, and pre-built Docker images for TensorFlow, PyTorch, and friends into a single deployable stack. Administrators manage nodes through a web portal and a paictl CLI; users submit training jobs without worrying about the hardware underneath.

The interesting bit

The project carries Microsoft’s “proven track record in large-scale production environment” — a rare claim of battle-tested lineage in open-source cluster tooling. It also shed its Hadoop YARN roots in v1.0, migrating fully to Kubernetes with a custom HiveD scheduler for GPU-aware placement.

Key highlights

  • Supports on-premises, hybrid, cloud, or single-box deployment
  • Modular architecture: marketplace, VS Code extension, SDK, and runtime are separate repos you can swap in or out
  • Pre-built containers for popular frameworks; distributed training ready
  • Virtual clusters for multi-tenant resource isolation
  • End-to-end manuals for both administrators and end users

Caveats

  • The repository is read-only as of v1.8.1 (December 2021); no major features planned, and collaboration requires contacting repo admins directly
  • The README’s upgrade table references v1.0.0 as “latest” but the banner says v1.8.1 is the actual final release — documentation drift is visible

Verdict

Worth studying if you’re building an internal GPU-sharing platform and want to see how Microsoft solved user quotas, job orchestration, and framework containerization. Not worth adopting fresh unless you plan to fork and maintain it yourself.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.