mit-han-lab/llm-awq
AWQ provides low-bit weight quantization (INT3/4) for efficient LLM compression and acceleration with optimized CUDA kernels.

AWQ implements activation-aware weight quantization to compress large language models into 3-4 bit precision while maintaining accuracy. The library includes pre-computed quantization parameters for popular LLMs (LLaMA, OPT, CodeLlama, Vicuna, VILA, LLaVA) and provides memory-efficient 4-bit linear layers in PyTorch with custom CUDA kernels for fast context and decoding inference. TinyChat 2.0 extends this work to deliver state-of-the-art prefilling speeds for LLMs and VLMs on edge devices including RTX 4090 and Jetson Orin.