← all repositories
karpathy/nanochat

GPT-2 for the price of a nice dinner

Karpathy's minimal LLM training harness turns a $43K 2019 training run into a sub-$100 afternoon project.

nanochat
Velocity · 7d
+230
★ / day
Trend
steady
star history

What it does nanochat is a stripped-down, single-GPU-node framework for training language models from scratch through the full lifecycle: tokenization, pretraining, finetuning, evaluation, inference, and a ChatGPT-style web UI. The pitch is simple enough to fit in a shell script: bash runs/speedrun.sh on an 8×H100 node, wait roughly two hours, and you’ve got a conversational model at GPT-2 capability.

The interesting bit The entire hyperparameter zoo—width, heads, learning rate, training horizon, weight decay—collapses into one knob: --depth, the transformer layer count. Set depth and the framework auto-tunes everything else to stay compute-optimal. It’s a bet that scaling laws have matured enough to make hand-tuning obsolete, and the “GPT-2 speedrun” leaderboard (now down to 1.65 hours) treats training time as a competitive sport.

Key highlights

  • Replicates GPT-2 (1.6B params, DCLM CORE score) for ~$48 on-demand, ~$15 spot instances; original 2019 training cost ~$43,000
  • Explicit mixed-precision via a global COMPUTE_DTYPE instead of PyTorch’s autocast; weights stay fp32, forward passes cast on the fly
  • Single-file speedrun pipeline (runs/speedrun.sh) plus research scripts for scaling-law sweeps and miniseries generation
  • Runs on single GPU (8× slower), A100s, CPU/MPS, though sub-80GB cards need batch-size tuning to avoid OOM
  • Chat web UI included; model personality customizable through synthetic data injection in the SFT stage

Caveats

  • CPU/MPS runs are “you will not get strong results” territory per the README; this is firmly GPU-first
  • RL training doesn’t yet support fp16 GradScaler, unlike pretraining and SFT
  • Non-CUDA paths (xpu, etc.) are “fairly vanilla PyTorch” but largely untested by the author

Verdict Ideal for researchers who want hackable, end-to-end LLM training without framework bloat, or anyone who finds pedagogical value in watching loss curves on their own hardware. If you just need an API key to call GPT-4, this is not your shortcut.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.