NVIDIA/Model-Optimizer
NVIDIA library providing quantization, pruning, NAS, distillation, and speculative decoding to compress and optimize deep learning models for faster inference.

Model Optimizer is a unified library of state-of-the-art model optimization techniques including quantization, pruning, Neural Architecture Search, distillation, speculative decoding, and sparsity. It compresses deep learning models and exports optimized checkpoints ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, and TensorRT. The library integrates with Hugging Face, PyTorch, ONNX models, and NVIDIA’s Megatron-LM ecosystem.