higgsfield-ai/higgsfield
Fault-tolerant GPU cluster orchestrator and ML training framework for trillion-parameter LLMs.

Higgsfield is an open-source GPU workload manager and machine learning framework specialized for distributed training of massive neural networks. It allocates compute resources, manages node access for training tasks, and supports ZeRO-3 DeepSpeed and fully sharded data parallel APIs from PyTorch for efficient trillion-parameter model sharding. The framework includes job queuing, resource contention management, and GitHub Actions integration for continuous ML development.