← all repositories
openai/sparse_attention

OpenAI's sparse attention kernels: still frozen in 2019

Fused CUDA kernels for attention patterns that skip most of the QK^T matrix, letting transformers stretch to longer sequences without melting your GPU.

sparse_attention
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

This is OpenAI’s reference implementation of the sparse attention primitives from their 2019 “Sparse Transformers” paper. It provides fused CUDA kernels that compute attention while respecting block-sparsity patterns in the QK^T matrix — meaning you define which chunks of the attention matrix to actually calculate, and the rest get skipped entirely. The repo includes standard dense attention (with the upper triangle already elided), plus “strided” and “fixed” sparse patterns, and a small recompute decorator for memory management.

The interesting bit

The sparsity isn’t just a mask applied after the fact — it’s wired into the kernel at the block level. You specify a 0/1 pattern on a grid of blocks, and those blocks simply aren’t computed or included in softmax. There’s also a callback mechanism for finer-grained masking within computed blocks. It’s attention as stencil operation, not attention as brute-force matrix multiply.

Key highlights

  • Fused kernels for QK^T with configurable block sparsity; block sizes of 8, 16, 32, 64 supported
  • “Strided” and “fixed” attention patterns from the Sparse Transformers paper implemented natively
  • Includes both a blocksparse path (requires Tensor Cores for fp16/smaller blocks) and a fallback TensorFlow path
  • Simple recompute=True decorator for gradient checkpointing-style memory savings
  • Requires OpenAI’s separate blocksparse package, which needs CUDA 10 + tensorflow-gpu or manual source build

Caveats

  • Archived and explicitly unmaintained — “code is provided as-is, no updates expected”
  • Depends on TensorFlow and CUDA 10 era tooling; the blocksparse dependency is itself a separate repo to wrangle
  • The “state-of-the-art” follow-up work from August 2020 lives in a different repository entirely

Verdict

Worth studying if you’re implementing sparse attention patterns from scratch and want to see how OpenAI structured the kernel interface — the block-sparsity abstraction is clean. Skip it if you need something that runs on modern PyTorch or CUDA 12 without archaeology; this is a research artifact, not a library.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.