← all repositories

qwopqwop200/GPTQ-for-LLaMa

A Python library for 4-bit quantization of LLaMA models using the GPTQ algorithm to reduce memory footprint and enable efficient inference.

GPTQ-for-LLaMa
Velocity · 7d
+2.6
★ / day
Trend
steady
star history

GPTQ-for-LLaMa implements post-training quantization for the LLaMA language model family, compressing models to 4-bit precision while maintaining model quality. It uses the GPTQ algorithm, described as a state-of-the-art one-shot weight quantization method, to achieve significant memory reduction (from ~14GB to ~4.7GB for LLaMA-7B) with minimal perplexity degradation on benchmarks like Wikitext2. The technique supports group sizes for fine-grained quantization control and is designed to enable efficient inference of large language models on resource-constrained hardware.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.