OpenPPL/ppl.nn

A C++ high-performance deep-learning inference engine supporting ONNX models and LLMs like LLaMA, ChatGLM, and Baichuan.

★1.4k stars C++ Inference · Serving Language Models ML Frameworks

View on GitHub ↗

Velocity · 7d

+0.8

★ / day

Trend

→steady

star history

PPLNN (Primitive Library for Neural Network) is a high-performance deep-learning inference engine written in C++. It runs ONNX models and has strong support for OpenMMLab ecosystems. The library includes an LLM engine with Flash Attention, split-k attention, group-query attention, dynamic batching, and tensor parallelism for distributed inference. It supports INT8 quantization including groupwise KV cache and per-token per-channel quantization. The project explicitly supports popular LLM architectures including LLaMA, ChatGLM, Baichuan, and InternLM.