← all repositories

mit-han-lab/streaming-llm

A framework for deploying LLMs on infinite-length inputs through efficient KV cache management using attention sinks.

streaming-llm
Velocity · 7d
+7.4
★ / day
Trend
steady
star history

StreamingLLM enables LLMs to process arbitrarily long sequences without fine-tuning by retaining only the most recent KV cache entries and a fixed number of attention sink tokens. This approach solves the memory and context-length limitations of standard attention mechanisms during inference, making it suitable for multi-round dialogue and other streaming applications. It has been integrated into major inference frameworks including HuggingFace Transformers and TensorRT-LLM.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.