AutoGPTQ/AutoGPTQ
A Python package for quantizing LLMs to 4-bit or 8-bit weights using the GPTQ algorithm for faster inference.

Velocity · 7d
+4.4
★ / day
Trend
→steady
star history
AutoGPTQ provides user-friendly APIs for quantizing large language models based on the GPTQ weight-only quantization method. It supports Marlin optimized int4 kernels for faster matrix multiplication and integrates with Hugging Face Transformers, optimum, and peft libraries. The package enables significant inference speedups (e.g., ~35% faster for Llama-7b) with reduced memory footprint while maintaining model quality.