← all repositories

lyogavin/airllm

AirLLM enables 70B+ parameter LLM inference on single 4GB GPUs through memory optimization without quantization, distillation, or pruning.

airllm
Velocity · 7d
+18
★ / day
Trend
steady
star history

The library focuses on optimizing inference memory usage for large language models, allowing massive models like Llama 3.1 405B to run on limited consumer GPU VRAM. It achieves this through attention mechanism optimization rather than relying on quantization, distillation, or pruning techniques. The project includes Jupyter Notebook examples and supports various model configurations.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.