InternLM/lmdeploy
A toolkit for compressing, deploying, and serving large language models with GPU acceleration and quantization.

LMDeploy is an open-source toolkit focused on LLM inference and serving. It provides compression techniques including quantization (supporting symmetric and asymmetric 4-bit modes), CUDA kernel optimization, and integration with inference engines like FasterTransformer and TurboMind. The toolkit supports a wide range of LLMs including Llama, Llama2, Llama3, CodeLlama, InternLM, and Qwen families, enabling efficient deployment across hardware platforms.