zhihu/ZhiLight
A high-performance LLM inference engine developed by Zhihu and ModelBest for accelerated model serving on GPUs.

Velocity · 7d
+1.6
★ / day
Trend
→steady
star history
ZhiLight is a CUDA-based inference engine targeting PCIe GPUs that accelerates transformer-based LLMs including Llama, DeepSeek-V3/R1, and LLaMA3 variants. It implements tensor parallelism, pipeline parallelism, and dynamic batching alongside fused kernels for attention, layer norm, and quantization operations. The engine supports INT8, FP8, AWQ, and GPTQ quantization with custom memory management and async OpenAI-compatible APIs adapted from vllm.