Is exllamav2 open source?

Yes — turboderp-org/exllamav2 is open source, released under the MIT license.

What language is exllamav2 written in?

turboderp-org/exllamav2 is primarily written in Python.

How popular is exllamav2?

turboderp-org/exllamav2 has 4.6k stars on GitHub.

Where can I find exllamav2?

turboderp-org/exllamav2 is on GitHub at https://github.com/turboderp-org/exllamav2.

← all repositories

turboderp-org/exllamav2

Archived: the inference engine that squeezed 70B models into 24 GB

An inference library built to make multi-billion-parameter LLMs runnable on a single consumer GPU through aggressive, per-layer mixed quantization.

★4.6k stars Python Inference · Serving

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

ExLlamaV2 is a Python inference engine for running local LLMs on modern consumer NVIDIA GPUs. It supports standard GPTQ models and its own EXL2 format, exposing a Python API for single, batched, and streamed asyncio generation. The repository is currently archived, with development continuing on ExLlamaV3.

The interesting bit

The real trick is EXL2 quantization, which can mix 2-, 3-, 4-, 5-, 6-, and 8-bit weights within the same linear layer—allocating more bits to more important columns—so the author reports squeezing a 70B model onto a single 24 GB GPU at 2.55 bits per weight. A newer dynamic generator adds paged attention via Flash Attention, dynamic batching, prompt caching, and K/V cache deduplication behind a consolidated API.

Key highlights

Supports both GPTQ and its own EXL2 format, which mixes 2–8 bit quantization within individual linear layers to hit a target average bitrate
Dynamic generator consolidates inference behind one API, adding Flash Attention paged attention, dynamic batching, prompt caching, and K/V cache deduplication
Author-reported benchmarks on an RTX 4090 reach 770 t/s for a 1.1B TinyLlama and 38 t/s for a 70B Llama2, though the README warns that slow CPUs can bottleneck performance
Python API handles single prompts, batched generation, and streamed asyncio output
Ecosystem integrations include TabbyAPI (OpenAI-compatible server), ExUI, text-generation-webui, and lollms-webui

Caveats

The repository is explicitly archived; active development has moved to ExLlamaV3
Quantizing large models to EXL2 is described as “somewhat slow”
The README warns that a slow CPU can still bottleneck GPU inference speeds

Verdict

Study it if you are researching aggressive quantization strategies or maintaining existing EXL2 pipelines, but start new work on the actively developed ExLlamaV3 instead.

Frequently asked

What is turboderp-org/exllamav2?: An inference library built to make multi-billion-parameter LLMs runnable on a single consumer GPU through aggressive, per-layer mixed quantization.
Is exllamav2 open source?: Yes — turboderp-org/exllamav2 is open source, released under the MIT license.
What language is exllamav2 written in?: turboderp-org/exllamav2 is primarily written in Python.
How popular is exllamav2?: turboderp-org/exllamav2 has 4.6k stars on GitHub.
Where can I find exllamav2?: turboderp-org/exllamav2 is on GitHub at https://github.com/turboderp-org/exllamav2.