← all repositories
Smerity/sha-rnn

One attention head, one GPU, 24 hours: near-Transformer results

A research project showing you don't need multi-head attention or TPU pods to get competitive language modeling.

1.2k stars Python Language Models
sha-rnn
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

SHA-RNN bolts a single attention mechanism onto a four-layer LSTM and trains on byte-level text (enwik8). The goal is to match Transformer-class results without the Transformer-class hardware bill or training fragility. It hits ~1.07 BPC on enwik8 in under a day on a single 12GB Titan V.

The interesting bit

The paper’s title is a pun and the architecture is a provocation: one attention head, placed only in the second-to-last layer by default, is enough to capture 5,000-token dependencies. The model can also shed its attention state and fall back to a plain LSTM if memory gets tight — a graceful degradation path Transformers don’t offer.

Key highlights

  • Trains in ~30 minutes per epoch on a Titan V; full run under 24 hours
  • Supports 5,000-token context without the compute/memory explosion of full self-attention
  • Avoids Transformer training rituals: no long warmup, no hyper-sensitive hyperparameter grid
  • Built from standard PyTorch parts (LSTM, linear layers) — ONNX-exportable, no custom kernels
  • Within striking distance of Transformer-XL (1.07 vs. 1.06 BPC) with fewer parameters than the 18-layer variant

Caveats

  • The code is “not kind”: no CLI flags for model variants, manual edits to model.py required
  • Author notes active bug-shaking; near-replication achieved but discrepancies remain
  • Still ~0.09 BPC off true SOTA; framed explicitly as efficiency play, not accuracy crown

Verdict

Worth a look if you’re productionizing language models on commodity GPUs or skeptical that every problem needs a 175B-parameter Transformer. Skip if you need plug-and-play code or are chasing leaderboard-topping BPC at any cost.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.