lucidrains/linear-attention-transformer
A Transformer variant combining local and global attention mechanisms that scales linearly with sequence length for efficient language modeling.

This repository implements a Transformer architecture with a hybrid attention mechanism combining local (QK^T)V attention with global Q(K^TV) attention for linear time and memory complexity. It includes features like reversible networks, feedforward chunking, and embedding factorization to optimize memory usage. The library is designed for long-sequence language modeling tasks where standard quadratic attention becomes prohibitive.