karpathy/llm.c

Andrej Karpathy drops PyTorch, trains GPT-2 in 1,000 lines of C

A from-scratch LLM trainer that ditches 245MB of PyTorch dependencies for raw C/CUDA, and somehow runs slightly faster.

★30.1k stars Cuda Language Models ML Frameworks Inference · Serving

View on GitHub ↗

Velocity · 7d

+38

★ / day

Trend

→steady

star history

What it does

llm.c trains GPT-2 (and eventually GPT-3) models using nothing but C and CUDA—no PyTorch, no Python runtime. The core CPU reference implementation fits in a single ~1,000-line file, train_gpt2.c, while the production path lives in train_gpt2.cu. A parallel PyTorch implementation in train_gpt2.py exists strictly for verification and comparison.

The interesting bit

The project treats “educational” and “fast” as non-conflicting goals. The dev/cuda directory collects hand-written, documented kernels ranging from naive to optimized, while the mainline freely swaps in vendor libraries (cuBLAS, cuDNN, NCCL) when raw speed matters. It’s a living benchmark: your custom kernel is measured against the expert upper bound, not against vague intuition.

Key highlights

Currently ~7% faster than PyTorch Nightly on the mainline CUDA path
Single-file CPU reference (train_gpt2.c) for actually understanding the algorithm
Multi-GPU and multi-node training via MPI/NCCL, with three different initialization strategies for stubborn cluster environments
Unit tests that verify C and CUDA outputs match PyTorch exactly (overall okay: 1)
Flash Attention via cuDNN available, though it balloons compile time from seconds to ~a minute

Caveats

The CPU path is explicitly a “you won’t get far” demo; training on Apple Silicon M3 Max takes ~1.3 seconds per step for a tiny 124M model
cuDNN integration is new enough that it’s disabled by default
The README notes the tension between simplicity and speed: a 2% performance gain that costs 500 lines of complexity may be rejected

Verdict

Worth your time if you’re learning CUDA, skeptical of framework bloat, or want to see how close to the metal LLM training can get. Skip it if you need production features like checkpoint resumption, mixed-precision convenience, or anything resembling HuggingFace integration.