A transformer you can actually read in one sitting
A minimal PyTorch transformer implementation that prioritizes clarity over scale, now archived and moved to Codeberg.

What it does
This is a from-scratch transformer in PyTorch, stripped down to the essentials. No abstractions hiding the attention mechanism, no distributed training boilerplate, no 10,000-line files. Just the core architecture: embeddings, multi-head self-attention, feed-forward layers, and positional encoding, wired together plainly enough to trace with a cup of coffee.
The interesting bit
Most educational transformer code either drowns you in framework magic or leaves out the tricky parts. This one sits in a narrow middle ground: complete enough to train, small enough to fit in your head. The archival notice suggests the author kept iterating elsewhere, but the GitHub snapshot remains a readable fossil.
Key highlights
- Pure PyTorch, no external transformer libraries
- Self-contained implementation of attention, layer norm, residual connections
- Explicit enough to modify for experiments or pedagogy
- 1,098 stars suggest it found its audience
- Current maintenance lives at
codeberg.org/pbm/former
Caveats
- Repository is explicitly unmaintained on GitHub; latest version elsewhere
- README is a one-line redirect, so details on training scripts, datasets, or performance are absent from this snapshot
Verdict
Grab this if you’re teaching transformers, debugging your own implementation, or just want to see the algorithm without the infrastructure. Skip it if you need production-scale training, current bug fixes, or documentation beyond the code itself.