Is GPTQ-for-LLaMa open source?

Yes — qwopqwop200/GPTQ-for-LLaMa is open source, released under the Apache-2.0 license.

What language is GPTQ-for-LLaMa written in?

qwopqwop200/GPTQ-for-LLaMa is primarily written in Python.

How popular is GPTQ-for-LLaMa?

qwopqwop200/GPTQ-for-LLaMa has 3.1k stars on GitHub.

Where can I find GPTQ-for-LLaMa?

qwopqwop200/GPTQ-for-LLaMa is on GitHub at https://github.com/qwopqwop200/GPTQ-for-LLaMa.

← all repositories

qwopqwop200/GPTQ-for-LLaMa

Squeeze LLaMA into 4 bits, but the author moved on

Implements one-shot GPTQ quantization to compress LLaMA down to 3 or 4 bits, reducing GPU memory by roughly two-thirds—though the author now recommends AutoGPTQ.

★3.1k stars Python Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This toolkit applies the GPTQ post-training quantization algorithm to Meta’s LLaMA models, compressing weights to 3 or 4 bits. It shrinks checkpoint sizes by roughly two-thirds to three-quarters and cuts GPU memory enough to fit LLaMA-33B on a single RTX 3090. The repo provides scripts for benchmarking, inference, and exporting compressed weights to safetensors or TOML formats.

The interesting bit

The maintainer is refreshingly candid: they now focus on AutoGPTQ and explicitly tell visitors to use that instead. What remains is a transparent, minimal reference implementation of GPTQ on LLaMA, complete with Triton kernels and head-to-head perplexity tables against bitsandbytes.

Key highlights

Cuts LLaMA-7B GPU memory from ~13.9 GB (FP16) to under 5 GB at 4-bit; LLaMA-33B drops from OOM to ~19.5 GB.
Perplexity on Wikitext2 and C4 stays close to FP16 baselines, especially with a group-size of 128.
Includes direct memory and perplexity comparisons with bitsandbytes NF4 and FP4 formats.
Supports CPU-memory offload for models too large for VRAM, though the README notes this is “very slow.”
Based directly on the IST-DASLab GPTQ reference and fpgaminer’s Triton kernels.

Caveats

The author explicitly recommends AutoGPTQ over this repo for new projects.
Triton kernels mean Linux-only support; Windows users are directed to WSL2.
Even quantized 4-bit LLaMA-65B OOMs on a single RTX 3090 without layer offloading, and quantization itself requires substantial CPU RAM.

Verdict

Worth a look if you want to understand the raw mechanics of LLaMA quantization, but practitioners should probably head straight to AutoGPTQ. Skip it if you need Windows-native support or turnkey deployment.

Frequently asked

What is qwopqwop200/GPTQ-for-LLaMa?: Implements one-shot GPTQ quantization to compress LLaMA down to 3 or 4 bits, reducing GPU memory by roughly two-thirds—though the author now recommends AutoGPTQ.
Is GPTQ-for-LLaMa open source?: Yes — qwopqwop200/GPTQ-for-LLaMa is open source, released under the Apache-2.0 license.
What language is GPTQ-for-LLaMa written in?: qwopqwop200/GPTQ-for-LLaMa is primarily written in Python.
How popular is GPTQ-for-LLaMa?: qwopqwop200/GPTQ-for-LLaMa has 3.1k stars on GitHub.
Where can I find GPTQ-for-LLaMa?: qwopqwop200/GPTQ-for-LLaMa is on GitHub at https://github.com/qwopqwop200/GPTQ-for-LLaMa.