qwopqwop200/GPTQ-for-LLaMa
A Python library for 4-bit quantization of LLaMA models using the GPTQ algorithm to reduce memory footprint and enable efficient inference.

GPTQ-for-LLaMa implements post-training quantization for the LLaMA language model family, compressing models to 4-bit precision while maintaining model quality. It uses the GPTQ algorithm, described as a state-of-the-art one-shot weight quantization method, to achieve significant memory reduction (from ~14GB to ~4.7GB for LLaMA-7B) with minimal perplexity degradation on benchmarks like Wikitext2. The technique supports group sizes for fine-grained quantization control and is designed to enable efficient inference of large language models on resource-constrained hardware.