Is AutoAWQ open source?

Yes — casper-hansen/AutoAWQ is open source, released under the MIT license.

What language is AutoAWQ written in?

casper-hansen/AutoAWQ is primarily written in Python.

How popular is AutoAWQ?

casper-hansen/AutoAWQ has 2.3k stars on GitHub.

Where can I find AutoAWQ?

casper-hansen/AutoAWQ is on GitHub at https://github.com/casper-hansen/AutoAWQ.

← all repositories

casper-hansen/AutoAWQ

The solo project that squeezed LLMs into 4 bits calls it a day

A one-person effort to make 4-bit weight quantization practical for thousands of Hugging Face models.

★2.3k stars Python Inference · Serving

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

AutoAWQ implements Activation-aware Weight Quantization (AWQ), compressing LLM weights to 4-bit integers while keeping inference compatible with Hugging Face transformers. The project claims a 3× speedup and 3× memory reduction over FP16 for memory-bound workloads, and supports NVIDIA, AMD ROCm, and Intel CPUs and GPUs. It can quantize a 7B model in roughly 10–15 minutes and a 70B model in about an hour, producing ready-to-run checkpoints.

The interesting bit

The README is unusually honest about the trade-offs: at small batch sizes the shrunken weights move faster through memory, but at higher batch sizes the constant INT4-to-FP16 dequantization overhead can actually make AWQ slower than FP16. An optional fused-module path layers multiple operations into a single kernel for an extra speed boost, but it locks your sequence length and batch size at load time and only works on Linux.

Key highlights

Officially deprecated: the solo maintainer has handed the project off to the vLLM team (via llm-compressor) and to MLX-LM for Mac support.
Last known good configuration is PyTorch 2.6.0 and Transformers 4.51.3; future compatibility is explicitly not guaranteed.
Two kernel flavors: GEMV for single-batch speed and GEMM for larger contexts and batch sizes below 8.
Over 7,000 AWQ models are already on Hugging Face, making it a de facto standard for 4-bit model sharing.
Fused modules use FasterTransformer kernels, though they return dummy past_key_values and disable dynamic sequence lengths after model creation.

Caveats

Fused modules are Linux-only and require you to fix max_seq_len and batch_size when loading the model.
The project is no longer maintained; the maintainer explicitly tells users to report future breakage to the Transformers project itself.
At high batch sizes, the dequantization overhead can make AWQ slower than FP16, so it is not a universal throughput win.

Verdict

Use this if you need to run inference on a 7B–70B model with a single consumer GPU and want a mature quantization backend with broad Hugging Face support. Look to vLLM’s native FP16 or the new llm-compressor pipeline instead if you are building a high-throughput serving stack.

Frequently asked

What is casper-hansen/AutoAWQ?: A one-person effort to make 4-bit weight quantization practical for thousands of Hugging Face models.
Is AutoAWQ open source?: Yes — casper-hansen/AutoAWQ is open source, released under the MIT license.
What language is AutoAWQ written in?: casper-hansen/AutoAWQ is primarily written in Python.
How popular is AutoAWQ?: casper-hansen/AutoAWQ has 2.3k stars on GitHub.
Where can I find AutoAWQ?: casper-hansen/AutoAWQ is on GitHub at https://github.com/casper-hansen/AutoAWQ.