OpenGVLab/OmniQuant
LLM quantization framework that compresses model weights (W4/W3/W2) and activations for efficient inference on GPUs and mobile devices.

OmniQuant provides quantization algorithms for large language models, supporting weight-only quantization (W4A16/W3A16/W2A16) and weight-activation quantization (W6A6/W4A4). The repository includes a model zoo with pre-quantized checkpoints for LLaMA, LLaMA-2-Chat, OPT, Falcon, and Mixtral-7Bx8. It also integrates with MLC-LLM to deploy quantized models on GPUs and mobile hardware, enabling efficient inference for resource-constrained environments.