stas00/ml-engineering
An open technical handbook providing methodologies, tools, and step-by-step instructions for training and running large language models and vision-language models.

This repository documents practical ML engineering knowledge accumulated while training large models including BLOOM-176B and IDEFICS-80B. It covers hardware selection (GPUs, storage, networking), orchestration with systems like SLURM, model training optimization, and inference deployment. The content is structured as technical guides with scripts and commands intended for LLM/VLM training engineers and ML operators.