← all repositories

turboderp-org/exllamav2

A Python inference library for efficiently running quantized large language models on consumer GPUs.

4.5k stars Python Inference · Serving
exllamav2
Velocity · 7d
+4.5
★ / day
Trend
steady
star history

ExLlamaV2 provides a high-performance inference engine for running large language models locally on modern consumer-grade GPUs. It features paged attention via Flash Attention 2.5.7+, dynamic batching with smart prompt caching, K/V cache deduplication, and supports speculative decoding. The library offers a simplified API for both single and batched generation, with async streaming support.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.