Is auto-round open source?

Yes — intel/auto-round is open source, released under the Apache-2.0 license.

What language is auto-round written in?

intel/auto-round is primarily written in Python.

How popular is auto-round?

intel/auto-round has 1.5k stars on GitHub.

Where can I find auto-round?

intel/auto-round is on GitHub at https://github.com/intel/auto-round.

← all repositories

intel/auto-round

Squeezing LLMs down to 2 bits, mostly on purpose

AutoRound uses sign-gradient descent to compress LLMs and VLMs into 2–4 bit weights that stay coherent on modest hardware.

★1.5k stars Python Inference · Serving LLMOps · Eval

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

AutoRound is a quantization toolkit that compresses Large Language Models and Vision-Language Models down to 2–4 bit weights. It relies on sign-gradient descent to minimize accuracy loss while keeping quantization time short—Intel claims a 7B model takes roughly ten minutes on a single GPU. The toolkit exports to several formats including AutoGPTQ, AutoAWQ, GGUF, and its own native format, and it plugs directly into inference engines like vLLM, SGLang, and Hugging Face Transformers.

The interesting bit

The project treats quantization as an optimization recipe rather than a single algorithm. It offers five different CLI recipes—from auto-round-best (slowest, highest accuracy) to auto-round-rtn (pure round-to-nearest, no tuning)—and can even generate mixed-precision schemes automatically in minutes. That flexibility lets you trade accuracy for speed depending on whether you are shipping to production or just experimenting.

Key highlights

Supports ultra-low-bit widths (W2A16, W3A16, W4A16) and exotic dtypes like MXFP4 and NVFP4.
Exports to AutoGPTQ, AutoAWQ, GGUF, and LLM-Compressor formats for broad compatibility.
Quantizes a 7B model in about ten minutes on one GPU, according to the README.
Includes out-of-the-box support for 10+ vision-language models.
Offers a model-free / calibration-free pure RTN mode when you lack GPU time or patience.

Caveats

MXFP4 currently lacks real inference kernels, so you will need to export through LLM-Compressor to actually run it.
The README itself advises falling back to pure RTN mode if the tuned quantization encounters issues, suggesting the full algorithm can be finicky.
Faster recipes (auto-round-light, auto-round-opt-rtn) explicitly trade accuracy for speed, especially at W2.

Verdict

Worth a look if you need to serve large models on limited VRAM or CPU RAM and can tolerate a quantization step. Skip it if you are already happy with FP16 or have hardware that does not benefit from weight-only quantization.

Frequently asked

What is intel/auto-round?: AutoRound uses sign-gradient descent to compress LLMs and VLMs into 2–4 bit weights that stay coherent on modest hardware.
Is auto-round open source?: Yes — intel/auto-round is open source, released under the Apache-2.0 license.
What language is auto-round written in?: intel/auto-round is primarily written in Python.
How popular is auto-round?: intel/auto-round has 1.5k stars on GitHub.
Where can I find auto-round?: intel/auto-round is on GitHub at https://github.com/intel/auto-round.