turboderp-org/exllamav2
A Python inference library for efficiently running quantized large language models on consumer GPUs.

Velocity · 7d
+4.5
★ / day
Trend
→steady
star history
ExLlamaV2 provides a high-performance inference engine for running large language models locally on modern consumer-grade GPUs. It features paged attention via Flash Attention 2.5.7+, dynamic batching with smart prompt caching, K/V cache deduplication, and supports speculative decoding. The library offers a simplified API for both single and batched generation, with async streaming support.