One attention head, one GPU, 24 hours: near-Transformer results
A research project showing you don't need multi-head attention or TPU pods to get competitive language modeling.

What it does
SHA-RNN bolts a single attention mechanism onto a four-layer LSTM and trains on byte-level text (enwik8). The goal is to match Transformer-class results without the Transformer-class hardware bill or training fragility. It hits ~1.07 BPC on enwik8 in under a day on a single 12GB Titan V.
The interesting bit
The paper’s title is a pun and the architecture is a provocation: one attention head, placed only in the second-to-last layer by default, is enough to capture 5,000-token dependencies. The model can also shed its attention state and fall back to a plain LSTM if memory gets tight — a graceful degradation path Transformers don’t offer.
Key highlights
- Trains in ~30 minutes per epoch on a Titan V; full run under 24 hours
- Supports 5,000-token context without the compute/memory explosion of full self-attention
- Avoids Transformer training rituals: no long warmup, no hyper-sensitive hyperparameter grid
- Built from standard PyTorch parts (LSTM, linear layers) — ONNX-exportable, no custom kernels
- Within striking distance of Transformer-XL (1.07 vs. 1.06 BPC) with fewer parameters than the 18-layer variant
Caveats
- The code is “not kind”: no CLI flags for model variants, manual edits to
model.pyrequired - Author notes active bug-shaking; near-replication achieved but discrepancies remain
- Still ~0.09 BPC off true SOTA; framed explicitly as efficiency play, not accuracy crown
Verdict
Worth a look if you’re productionizing language models on commodity GPUs or skeptical that every problem needs a 175B-parameter Transformer. Skip if you need plug-and-play code or are chasing leaderboard-topping BPC at any cost.