← all repositories

turboderp/exllama

A standalone Llama implementation optimized for running 4-bit quantized models on modern NVIDIA GPUs.

exllama
Velocity · 7d
+2.6
★ / day
Trend
steady
star history

ExLlama provides a memory-efficient rewrite of Hugging Face Transformers’ Llama implementation, specifically designed for GPTQ-quantized weights. It combines Python, C++, and CUDA code to achieve fast inference with reduced memory footprint. The project includes a web UI for interacting with quantized Llama models and supports modern NVIDIA GPUs (RTX 30-series and later).

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.