huggingface/optimum-quanto
A PyTorch quantization backend providing int2/int4/int8/float8 weight and activation quantization with CUDA acceleration for optimized model inference.

Velocity · 7d
+1.0
★ / day
Trend
→steady
star history
Optimum Quanto is a quantization library for PyTorch models that enables dynamic and static quantization with automatic stub insertion for quantized operations and modules. It accelerates matrix multiplications on CUDA devices and supports serialization compatible with PyTorch weight_only and safetensors formats, making it useful for optimizing LLM and transformer model inference while reducing memory footprint.