mit-han-lab/TinyChatEngine
On-device LLM/VLM inference engine written in C++ with SmoothQuant and AWQ quantization support for x86, ARM, and CUDA.

Velocity · 7d
+0.9
★ / day
Trend
→steady
star history
TinyChatEngine is a from-scratch C++ implementation for running compressed LLMs and VLMs on edge devices like laptops, cars, and robots. It uses SmoothQuant and AWQ quantization techniques for model compression and supports Intel/AMD x86, Apple M1/M2 ARM, and Nvidia CUDA platforms. The library enables real-time inference with better privacy since data stays local.