← all repositories

AutoGPTQ/AutoGPTQ

A Python package for quantizing LLMs to 4-bit or 8-bit weights using the GPTQ algorithm for faster inference.

AutoGPTQ
Velocity · 7d
+4.4
★ / day
Trend
steady
star history

AutoGPTQ provides user-friendly APIs for quantizing large language models based on the GPTQ weight-only quantization method. It supports Marlin optimized int4 kernels for faster matrix multiplication and integrates with Hugging Face Transformers, optimum, and peft libraries. The package enables significant inference speedups (e.g., ~35% faster for Llama-7b) with reduced memory footprint while maintaining model quality.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.