SqueezeAILab/SqueezeLLM
A post-training quantization framework for LLMs that uses dense-and-sparse weight decomposition to enable serving large models with reduced memory footprint.

SqueezeLLM implements Dense-and-Sparse Quantization, a technique that splits weight matrices into a heavily compressible dense component and a sparse component preserving sensitive outlier values. This approach allows serving large language models like Vicuna within 6 GB of memory while achieving higher accuracy than full-precision baselines. The framework is designed for efficient LLM serving and has been integrated into the vLLM inference engine.