KellerJordan/modded-nanogpt
NanoGPT (124M) training speedrun achieving target cross-entropy loss on FineWeb in under 90 seconds using 8 NVIDIA H100 GPUs.

This repository contains a collaborative speedrun to train a 124M parameter NanoGPT model to achieve 3.28 cross-entropy loss on the FineWeb validation dataset as quickly as possible. The project builds on Karpathy’s llm.c GPT-2 replication and incorporates modern training techniques including the Muon optimizer, FP8 matmul with asymmetric rescaling, Flash Attention 3 with long-short sliding window patterns, and architectural enhancements like rotary embeddings and skip connections. The goal is benchmarking training efficiency and algorithm optimization for language model training.