lyogavin/airllm
AirLLM enables 70B+ parameter LLM inference on single 4GB GPUs through memory optimization without quantization, distillation, or pruning.

Velocity · 7d
+18
★ / day
Trend
→steady
star history
The library focuses on optimizing inference memory usage for large language models, allowing massive models like Llama 3.1 405B to run on limited consumer GPU VRAM. It achieves this through attention mechanism optimization rather than relying on quantization, distillation, or pruning techniques. The project includes Jupyter Notebook examples and supports various model configurations.