Is mixtral-offloading open source?

Yes — dvmazur/mixtral-offloading is open source, released under the MIT license.

What language is mixtral-offloading written in?

dvmazur/mixtral-offloading is primarily written in Python.

How popular is mixtral-offloading?

dvmazur/mixtral-offloading has 2.3k stars on GitHub.

Where can I find mixtral-offloading?

dvmazur/mixtral-offloading is on GitHub at https://github.com/dvmazur/mixtral-offloading.

← all repositories

dvmazur/mixtral-offloading

Mixtral-8x7B on a budget: offloading experts to CPU RAM

It makes Mixtral-8x7B runnable on consumer hardware by quantizing attention and experts separately, then shuttling individual experts between CPU and GPU memory.

★2.3k stars Python Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This project implements efficient inference for Mixtral-8x7B models on memory-constrained hardware such as Google Colab or consumer desktops. It fits the model into combined GPU and CPU memory by applying separate HQQ quantization schemes to attention layers and to the mixture-of-experts layers. Each expert is offloaded individually to CPU RAM and recalled to the GPU only when its activation is required; an LRU cache retains recently used experts to reduce GPU-RAM communication across adjacent tokens.

The interesting bit

The trick is the granularity. Rather than shuffling entire layers or the whole model, the code treats each expert as an independently swappable unit. The LRU cache is a bet on temporal locality—if the next token needs the same expert, you skip another slow trip across the PCIe bus.

Key highlights

Mixed HQQ quantization with separate schemes for attention layers and MoE experts.
Expert-level offloading: each layer’s experts reside in CPU RAM and are fetched to GPU individually on demand.
LRU cache for active experts to exploit reuse across adjacent tokens.
Ships as a demo notebook targeting Colab; no standalone CLI script is provided yet.
Methods and results are detailed in an accompanying arXiv tech report.

Caveats

No command-line interface exists yet; running locally requires adapting the provided Jupyter notebook.
Several techniques described in the tech report have not yet landed in the repository.

Verdict

Worth a look if you want to experiment with Mixtral-8x7B but only have a single consumer GPU or a Colab tab. Pass if you need a production-ready CLI tool or the complete feature set promised in the paper today.

Frequently asked

What is dvmazur/mixtral-offloading?: It makes Mixtral-8x7B runnable on consumer hardware by quantizing attention and experts separately, then shuttling individual experts between CPU and GPU memory.
Is mixtral-offloading open source?: Yes — dvmazur/mixtral-offloading is open source, released under the MIT license.
What language is mixtral-offloading written in?: dvmazur/mixtral-offloading is primarily written in Python.
How popular is mixtral-offloading?: dvmazur/mixtral-offloading has 2.3k stars on GitHub.
Where can I find mixtral-offloading?: dvmazur/mixtral-offloading is on GitHub at https://github.com/dvmazur/mixtral-offloading.