li-plus/chatglm.cpp

C++ inference engine for ChatGLM-6B, ChatGLM2-6B, ChatGLM3, GLM-4, and CodeGeeX2 models with int4/int8 quantization.

★3k stars C++ Inference · Serving Language Models ML Frameworks

View on GitHub ↗

Velocity · 7d

+2.7

★ / day

Trend

→steady

star history

This repository provides a pure C++ implementation of ChatGLM series models and GLM-4 based on the ggml library, similar in design to llama.cpp. It enables efficient CPU and GPU inference with memory optimization through int4 and int8 quantization, supports P-Tuning v2 and LoRA fine-tuned variants, and offers streaming generation. The project includes Python bindings, web demos, and API servers for deployment.