Yes — IST-DASLab/gptq is open source, released under the Apache-2.0 license.

What language is gptq written in?

IST-DASLab/gptq is primarily written in Python.

IST-DASLab/gptq has 2.3k stars on GitHub.

Where can I find gptq?

IST-DASLab/gptq is on GitHub at https://github.com/IST-DASLab/gptq.

IST-DASLab/gptq

Compressing LLMs to 2–4 bits without retraining

It implements the GPTQ algorithm to compress generative transformer weights to 2–4 bits after training, because retraining models the size of OPT-175B is prohibitively expensive.

★2.3k stars Python Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does This is the official research codebase behind the ICLR 2023 GPTQ paper. It quantizes the weights of generative transformers—specifically the OPT and BLOOM families—to 2, 3, or 4 bits after training. The repository bundles evaluation scripts for language-generation perplexity and zero-shot tasks, plus a custom CUDA kernel for 3-bit matrix–vector products.

The interesting bit The authors discovered that quantizing columns in order of decreasing activation size (--act-order) and enforcing sequential quantization inside individual transformer blocks (--true-sequential) fixes GPTQ’s “strangely bad performance” on LLaMa-7B and dramatically tames outliers in OPT-66B. It is a rare case where an official paper repo openly admits its original method flubbed a small model and then shows exactly how two simple heuristics fixed it.

Key highlights

Post-training quantization of OPT and BLOOM weights to 2, 3, or 4 bits, with optional weight grouping.
A custom 3-bit CUDA kernel for quantized matrix–vector products, yielding up to a 3.25× generation speedup for OPT-175B on an A100.
--act-order and --true-sequential heuristics that recovered LLaMa-7B accuracy from 7.15 to 6.09 Wiki2 PPL and cut OPT-66B 3-bit perplexity from 14.16 to 9.95.
Scripts for perplexity evaluation, zero-shot task performance, and kernel benchmarking.
Minimal LLaMa integration; the authors redirect users to GPTQ-for-LLaMA for more complete features.

Caveats

The 3-bit CUDA kernels are explicitly optimized only for OPT-175B on 1×A100 or 2×A6000; the README warns of suboptimal performance on smaller models or other GPUs.
LLaMa support is minimal and requires installing transformers from source alongside sentencepiece.
The codebase is split into family-specific scripts (opt.py, bloom.py, llama.py) rather than offering a generic, model-agnostic API.

Verdict Worth a look if you are researching post-training quantization of large language models and have the target hardware. Skip it if you need a polished, model-agnostic inference library; this is a research artifact with scripts tailored to specific model families.

Frequently asked

What is IST-DASLab/gptq?: It implements the GPTQ algorithm to compress generative transformer weights to 2–4 bits after training, because retraining models the size of OPT-175B is prohibitively expensive.
Is gptq open source?: Yes — IST-DASLab/gptq is open source, released under the Apache-2.0 license.
What language is gptq written in?: IST-DASLab/gptq is primarily written in Python.
How popular is gptq?: IST-DASLab/gptq has 2.3k stars on GitHub.
Where can I find gptq?: IST-DASLab/gptq is on GitHub at https://github.com/IST-DASLab/gptq.