← all repositories

huggingface/llm_training_handbook

A technical handbook with scripts and commands for successfully training large language models, covering parallelism, throughput optimization, and training instabilities.

llm_training_handbook
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

This repository provides practical methodologies for LLM training engineers and operators working with distributed training across GPUs. It covers model parallelism strategies, throughput maximization techniques, tensor precision considerations, and debugging approaches for both software and hardware failures. Content is organized into categories for hyperparameters, SLURM job scheduling, and resource management, with working code examples alongside conceptual explanations.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.