← all repositories

dvmazur/mixtral-offloading

Efficiently runs Mixtral-8x7B mixture-of-experts LLM on resource-constrained hardware via quantization and CPU-GPU offloading.

mixtral-offloading
Velocity · 7d
+2.6
★ / day
Trend
steady
star history

This project implements techniques for running the Mixtral-8x7B mixture-of-experts language model on Colab or consumer desktops where the model would otherwise not fit in memory. It uses mixed quantization with HQQ to compress different layer types separately, and implements an MoE offloading strategy where each expert is kept in CPU RAM and fetched to GPU only when needed, with an LRU cache to minimize GPU-RAM communication.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.