dvmazur/mixtral-offloading
Efficiently runs Mixtral-8x7B mixture-of-experts LLM on resource-constrained hardware via quantization and CPU-GPU offloading.

Velocity · 7d
+2.6
★ / day
Trend
→steady
star history
This project implements techniques for running the Mixtral-8x7B mixture-of-experts language model on Colab or consumer desktops where the model would otherwise not fit in memory. It uses mixed quantization with HQQ to compress different layer types separately, and implements an MoE offloading strategy where each expert is kept in CPU RAM and fetched to GPU only when needed, with an LRU cache to minimize GPU-RAM communication.