← all repositories

awslabs/awsome-distributed-ai

AWS reference architectures and test cases for distributed large model training using cloud HPC infrastructure.

awsome-distributed-ai
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

This repository provides reference architectures and benchmark scripts for training large models on AWS cloud infrastructure including SageMaker HyperPod, AWS ParallelCluster, PCS, Batch, and EKS. It includes cloud formation templates, AMI/container build scripts, and test cases covering various frameworks and parallel optimization strategies such as PyTorch DDP/FSDP, Megatron-LM, and NeMo. The repository also offers micro-benchmarks for NCCL, NCCOM, and NVSHMEM to measure performance and troubleshoot distributed training clusters.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.