← all repositories

hpcaitech/SwiftInfer

TensorRT-based implementation of StreamingLLM for production-grade LLM inference serving.

SwiftInfer
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

SwiftInfer provides an optimized implementation of StreamingLLM using NVIDIA TensorRT and TensorRT-LLM, enabling efficient LLM inference with support for infinite input lengths through the Attention Sink mechanism. Built upon TensorRT-LLM v0.6.0, it aims to make streaming LLM inference production-grade by leveraging hardware-accelerated optimization for faster serving of large language models.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.