← all repositories

NVIDIA/TensorRT-LLM

NVIDIA's inference optimization framework for running LLMs efficiently on NVIDIA GPUs using specialized kernels and runtime orchestration.

TensorRT-LLM
Velocity · 7d
+13
★ / day
Trend
steady
star history

TensorRT LLM provides a Python API for defining Large Language Models and performs inference efficiently on NVIDIA GPUs through state-of-the-art optimizations. It includes specialized kernels for common operations and Python/C++ runtime components that orchestrate performant inference execution. The framework supports MoE architectures and integrates with PyTorch.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.