haoliuhl/ringattention
A distributed attention mechanism for training large language models with context lengths up to tens of millions of tokens.

This repository provides a JAX implementation of Ring Attention with Blockwise Parallel Transformers, enabling near-infinite context training for large language models. The approach distributes attention and feedforward computation across multiple devices while overlapping communication with computation, allowing training without additional overhead. It handles blockwise computing of attention and feedforward networks to efficiently scale to very long sequences.