OpenPPL/ppl.nn
A C++ high-performance deep-learning inference engine supporting ONNX models and LLMs like LLaMA, ChatGLM, and Baichuan.

PPLNN (Primitive Library for Neural Network) is a high-performance deep-learning inference engine written in C++. It runs ONNX models and has strong support for OpenMMLab ecosystems. The library includes an LLM engine with Flash Attention, split-k attention, group-query attention, dynamic batching, and tensor parallelism for distributed inference. It supports INT8 quantization including groupwise KV cache and per-token per-channel quantization. The project explicitly supports popular LLM architectures including LLaMA, ChatGLM, Baichuan, and InternLM.