turboderp/exllama
A standalone Llama implementation optimized for running 4-bit quantized models on modern NVIDIA GPUs.

Velocity · 7d
+2.6
★ / day
Trend
→steady
star history
ExLlama provides a memory-efficient rewrite of Hugging Face Transformers’ Llama implementation, specifically designed for GPTQ-quantized weights. It combines Python, C++, and CUDA code to achieve fast inference with reduced memory footprint. The project includes a web UI for interacting with quantized Llama models and supports modern NVIDIA GPUs (RTX 30-series and later).