NVIDIA/TensorRT-LLM
NVIDIA's inference optimization framework for running LLMs efficiently on NVIDIA GPUs using specialized kernels and runtime orchestration.

Velocity · 7d
+13
★ / day
Trend
→steady
star history
TensorRT LLM provides a Python API for defining Large Language Models and performs inference efficiently on NVIDIA GPUs through state-of-the-art optimizations. It includes specialized kernels for common operations and Python/C++ runtime components that orchestrate performant inference execution. The framework supports MoE architectures and integrates with PyTorch.