intel/neural-compressor
Open-source Python library for compressing LLMs and deep learning models via quantization, pruning, and sparsity across PyTorch, TensorFlow, and ONNX Runtime.

Intel Neural Compressor provides state-of-the-art model compression techniques including low-bit quantization (INT8/FP8/INT4/MXFP8/NVFP4), weight-only quantization, SmoothQuant, pruning, and sparsity. It supports popular LLMs such as LLaMA, Qwen, DeepSeek, and Flux, and integrates with AutoRound for automated quantization tuning. The library targets Intel hardware including Gaudi AI accelerators, Core Ultra processors, and Xeon Scalable processors to optimize model performance and memory footprint during inference.