lucidrains/native-sparse-attention-pytorch
PyTorch implementation of Deepseek's native sparse attention mechanism for efficient transformer inference.

Velocity · 7d
+1.7
★ / day
Trend
→steady
star history
This repository implements the sparse attention pattern from the Deepseek ‘Native Sparse Attention’ paper, designed to accelerate transformer-based language models. It provides a custom PyTorch attention module with configurable sliding window, compression blocks, and selection blocks. The implementation uses Triton and Flex Attention for efficient computation, and includes an example training script for Enwik8 language modeling.