A shooting gallery for attention mechanisms that claim to be efficient
Google Research built a standardized proving ground so you can stop arguing about whether your O(n log n) attention actually works.

What it does
Long Range Arena (LRA) is a benchmark suite that pits efficient Transformer variants against each other on six tasks—ranging from hierarchical ListOps to pixel-level image classification—while measuring accuracy, speed, and memory. It ships with JAX/Flax implementations of vanilla Transformers and a leaderboard of published results.
The interesting bit
The benchmark enforces an “apples-to-apples” mode: if you want on the official leaderboard, your model can’t exceed ~10% more parameters than the base Transformer, and you can’t touch embedding sizes or layer counts. It’s a deliberate attempt to stop the “we beat BERT on one task with 3× the compute” paper mill.
Key highlights
- Six diverse tasks: ListOps, text classification, document retrieval, pixel-level images, Pathfinder, and Path-X (the last one breaks most models—every listed variant except external entries “FAIL” on it as of Nov 2020)
- Includes implementations of Linformer, Performer, Reformer, BigBird, Longformer, and others in JAX/Flax
- Two comparison modes: strict “apples-to-apples” for leaderboard inclusion, or “free-for-all” if you just want numbers for your paper
- External submissions accepted but segregated; the authors verify all official results internally
- Datasets available via Google Cloud Storage; some (like document retrieval) require manual assembly from original sources due to redistribution limits
Caveats
- The V2 release added “xformer models” but the README still contains struck-through text about a planned update, suggesting the project may be only partially maintained
- Path-X remains unsolved by all core benchmarked models; it’s unclear if this is a meaningful hardness signal or a dataset artifact
- Document retrieval setup is awkward—you need to download raw AAN data yourself and match IDs
Verdict
Worth your time if you’re publishing (or evaluating) a new efficient attention mechanism and need credible, standardized comparisons. Skip it if you just want plug-and-play fast attention for a product—the code is research infrastructure, not a library.