A scrappy GPT-2 reimplementation that admits it can't quite match OpenAI
Independent training code for GPT-2 with TPU support, plus the rare honesty that the results fall short of the original.

What it does
This is a from-scratch TensorFlow implementation for training GPT-2 on GPUs or Google TPUs. It includes scripts to wrangle the OpenWebText corpus (Reddit-linked web pages), encode them as TFRecords, and train models from 117M parameters up to 1.5B. The author also released their own pretrained checkpoints, though they label them “inferior” to OpenAI’s.
The interesting bit
The README opens with a disclaimer that the author couldn’t replicate the original model’s full performance and has no idea why — a refreshing break from the usual benchmark inflation. The whole thing is built around JSON config files rather than argparse soup, and it includes a handwritten data pipeline that stitches short texts together so you never waste context window on padding.
Key highlights
- Supports both single GPUs and TPU pods (v2-256, v3-512) without code changes
- Released pretrained models: 117M, “PrettyBig” (~345M+), and 1.5B
- Custom data pipeline requires modifying
inputs.pyby hand — no slick abstraction - Dataset generation is documented but hacky; author spent ~€500 on cloud compute to process it
- Prediction only works on GPU/CPU, not TPUs
Caveats
- The author explicitly states performance does not match the original GPT-2 and the bug remains unfound
- Evaluation breaks on TPU pods and must be commented out
- Dataset scripts are “a bit hacky” and need manual adaptation
Verdict
Worth a look if you need a hackable, pre-Transformers-era GPT-2 training codebase with TPU support and don’t mind some assembly required. Skip it if you want battle-tested, drop-in reproductions or modern PyTorch ergonomics.