OpenAI's sparse attention kernels: still frozen in 2019
Fused CUDA kernels for attention patterns that skip most of the QK^T matrix, letting transformers stretch to longer sequences without melting your GPU.

What it does
This is OpenAI’s reference implementation of the sparse attention primitives from their 2019 “Sparse Transformers” paper. It provides fused CUDA kernels that compute attention while respecting block-sparsity patterns in the QK^T matrix — meaning you define which chunks of the attention matrix to actually calculate, and the rest get skipped entirely. The repo includes standard dense attention (with the upper triangle already elided), plus “strided” and “fixed” sparse patterns, and a small recompute decorator for memory management.
The interesting bit
The sparsity isn’t just a mask applied after the fact — it’s wired into the kernel at the block level. You specify a 0/1 pattern on a grid of blocks, and those blocks simply aren’t computed or included in softmax. There’s also a callback mechanism for finer-grained masking within computed blocks. It’s attention as stencil operation, not attention as brute-force matrix multiply.
Key highlights
- Fused kernels for QK^T with configurable block sparsity; block sizes of 8, 16, 32, 64 supported
- “Strided” and “fixed” attention patterns from the Sparse Transformers paper implemented natively
- Includes both a blocksparse path (requires Tensor Cores for fp16/smaller blocks) and a fallback TensorFlow path
- Simple
recompute=Truedecorator for gradient checkpointing-style memory savings - Requires OpenAI’s separate
blocksparsepackage, which needs CUDA 10 + tensorflow-gpu or manual source build
Caveats
- Archived and explicitly unmaintained — “code is provided as-is, no updates expected”
- Depends on TensorFlow and CUDA 10 era tooling; the blocksparse dependency is itself a separate repo to wrangle
- The “state-of-the-art” follow-up work from August 2020 lives in a different repository entirely
Verdict
Worth studying if you’re implementing sparse attention patterns from scratch and want to see how OpenAI structured the kernel interface — the block-sparsity abstraction is clean. Skip it if you need something that runs on modern PyTorch or CUDA 12 without archaeology; this is a research artifact, not a library.