datamllab/LongLM
Self-Extend extends LLM context windows without tuning by adding binary attention stratification based on relative distance.

This repository implements Self-Extend, a technique that enables large language models to handle longer context windows without requiring fine-tuning. It works by grouping attention into local and global buckets based on relative token distance, allowing models to reason beyond their native attention span. The implementation includes optimized versions using FlashAttention and Triton, with support for various LLMs including Llama, Qwen, and Gemma.