mit-han-lab/streaming-llm
A framework for deploying LLMs on infinite-length inputs through efficient KV cache management using attention sinks.

StreamingLLM enables LLMs to process arbitrarily long sequences without fine-tuning by retaining only the most recent KV cache entries and a fixed number of attention sink tokens. This approach solves the memory and context-length limitations of standard attention mechanisms during inference, making it suitable for multi-round dialogue and other streaming applications. It has been integrated into major inference frameworks including HuggingFace Transformers and TensorRT-LLM.