hpcaitech/SwiftInfer
TensorRT-based implementation of StreamingLLM for production-grade LLM inference serving.

Velocity · 7d
+0.5
★ / day
Trend
→steady
star history
SwiftInfer provides an optimized implementation of StreamingLLM using NVIDIA TensorRT and TensorRT-LLM, enabling efficient LLM inference with support for infinite input lengths through the Attention Sink mechanism. Built upon TensorRT-LLM v0.6.0, it aims to make streaming LLM inference production-grade by leveraging hardware-accelerated optimization for faster serving of large language models.