← all repositories

IST-DASLab/gptq

A post-training quantization method for compressing large language models to low bit-widths (2-4 bits) with custom CUDA kernels.

gptq
Velocity · 7d
+1.7
★ / day
Trend
steady
star history

This repository implements the GPTQ algorithm from the ICLR 2023 paper, providing efficient quantization of generative pretrained transformers. It includes implementations for compressing OPT and BLOOM model families, custom CUDA kernels for accelerated 3-bit matrix-vector products, and evaluation tools for measuring perplexity and ZeroShot performance on quantized models. The work also supports LLaMa models with techniques like activation-ordering and true-sequential quantization.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.