Andrej Karpathy's nanoGPT is now a museum piece
A deliberately minimal GPT-2 implementation that taught a generation how transformers work, now officially succeeded by nanochat.

What it does
nanoGPT is a stripped-down PyTorch implementation for training and fine-tuning GPT-2-scale language models. The entire codebase fits in two ~300-line files: train.py for the training loop and model.py for the transformer itself. It can reproduce GPT-2 (124M parameters) on OpenWebText in about four days on an 8×A100 node, or train a toy Shakespeare model on your laptop in three minutes.
The interesting bit
The README opens with a deprecation notice: Karpathy now points visitors to nanochat, leaving this repo up “for posterity.” That honesty is refreshing in a field where old repos usually just rot silently. When it was current, the project’s real trick was refusing to be clever—no abstractions, no framework, just raw PyTorch you could actually read and mutate.
Key highlights
- Loads OpenAI’s GPT-2 weights directly for fine-tuning or initialization
- Supports distributed training across multiple GPU nodes via
torchrun - Includes pre-built configs for CPU, single GPU, and multi-node A100 setups
sample.pyhandles inference from trained checkpoints or OpenAI’s released models- Apple Silicon supported via
--device=mpsfor 2–3× speedup over CPU
Caveats
- Explicitly deprecated as of November 2025; new work should use nanochat instead
- Multi-node training without Infiniband “will most likely crawl,” per the README
- Character-level Shakespeare demo is fun but produces “lol ¯\(ツ)/¯” quality output
Verdict
Worth studying if you want to understand how a modern transformer trainer is structured without drowning in framework indirection. Skip it if you’re building something new—follow the author’s own advice and use nanochat.