mit-han-lab/smoothquant
SmoothQuant enables INT8 weight and activation quantization for large language models to reduce memory footprint and accelerate inference.

SmoothQuant is a post-training quantization solution that addresses the challenge of quantizing LLMs beyond 100 billion parameters by migrating quantization difficulty from activations to weights. It smooths activation outliers and maintains accuracy while enabling efficient W8A8 quantization. The library integrates with major inference runtimes including NVIDIA TensorRT-LLM, ONNX Runtime, Intel Neural-Compressor, and Amazon SageMaker.