huggingface/llm_training_handbook
A technical handbook with scripts and commands for successfully training large language models, covering parallelism, throughput optimization, and training instabilities.

This repository provides practical methodologies for LLM training engineers and operators working with distributed training across GPUs. It covers model parallelism strategies, throughput maximization techniques, tensor precision considerations, and debugging approaches for both software and hardware failures. Content is organized into categories for hyperparameters, SLURM job scheduling, and resource management, with working code examples alongside conceptual explanations.