Yes — Vahe1994/AQLM is open source, released under the Apache-2.0 license.

What language is AQLM written in?

Vahe1994/AQLM is primarily written in Python.

Vahe1994/AQLM has 1.3k stars on GitHub.

Where can I find AQLM?

Vahe1994/AQLM is on GitHub at https://github.com/Vahe1994/AQLM.

Vahe1994/AQLM

Additive quantization that actually survives one-bit weights

AQLM exists to compress LLMs down to 1–2 bits per weight while keeping them coherent enough to run in a browser or on modest hardware.

★1.3k stars Python Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does AQLM is the official PyTorch implementation of Additive Quantization for Large Language Models, a compression scheme that represents weights as sums of codebook vectors rather than full-precision floats. The repository also implements PV-Tuning, a post-quantization fine-tuning algorithm that recovers accuracy after extreme compression. It ships pre-quantized checkpoints for LLaMA, Mistral, and Mixtral families and supports inference on GPU, CPU, or inside a browser via a companion Rust/WASM demo.

The interesting bit The companion paper’s title, Beyond Straight-Through Estimation for Extreme LLM Compression, is a quiet admission that the usual tricks stop working this far down. AQLM keeps going anyway: a Llama-2-7b variant at roughly 1 bit achieves a WikiText-2 perplexity of 7.85, and a 1-bit Llama-3-70B fits into 13 GB.

Key highlights

Pre-quantized models on HuggingFace span 1-bit and 2-bit variants of Llama 2/3, Mistral, Mixtral, Command-R, Phi-3, and Qwen2, with reported WikiText-2 perplexity and MMLU drops
Inference options include GPU kernels, CPU streaming, CUDA graphs for a ~3× speedup, vLLM serving, and PEFT fine-tuning
Base AQLM was accepted to ICML 2024; the PV-Tuning extension received a NeurIPS 2024 oral
A browser-native demo built in Rust/WASM lets you run AQLM+PV inference on CPU without installing anything
v1.1.7 adds support for arbitrary 8-dimensional codebooks on GPU and improves accuracy for 1-bit configurations

Caveats

Architecture support is explicitly limited to the LLaMA, Mistral, and Mixtral families; other model types are not currently handled
Perplexity numbers are evaluated at different context lengths—4k for Llama 2, 8k for Mistral/Mixtral and Llama 3—and the authors warn against comparing them across vocabularies
Checkpoints using g16 codebook schemes require inference library v1.1.6 or newer, creating a version compatibility trap for older environments

Verdict A solid choice if you need to run 70B-class models on consumer hardware or want to experiment with sub-2-bit quantization without building the stack yourself. If you already have enough VRAM for standard 4-bit loading, the extra complexity may not pay off.

Frequently asked

What is Vahe1994/AQLM?: AQLM exists to compress LLMs down to 1–2 bits per weight while keeping them coherent enough to run in a browser or on modest hardware.
Is AQLM open source?: Yes — Vahe1994/AQLM is open source, released under the Apache-2.0 license.
What language is AQLM written in?: Vahe1994/AQLM is primarily written in Python.
How popular is AQLM?: Vahe1994/AQLM has 1.3k stars on GitHub.
Where can I find AQLM?: Vahe1994/AQLM is on GitHub at https://github.com/Vahe1994/AQLM.