← all repositories

zhihu/ZhiLight

A high-performance LLM inference engine developed by Zhihu and ModelBest for accelerated model serving on GPUs.

ZhiLight
Velocity · 7d
+1.6
★ / day
Trend
steady
star history

ZhiLight is a CUDA-based inference engine targeting PCIe GPUs that accelerates transformer-based LLMs including Llama, DeepSeek-V3/R1, and LLaMA3 variants. It implements tensor parallelism, pipeline parallelism, and dynamic batching alongside fused kernels for attention, layer norm, and quantization operations. The engine supports INT8, FP8, AWQ, and GPTQ quantization with custom memory management and async OpenAI-compatible APIs adapted from vllm.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.