Run a 160GB model on an 8GB GPU—no quantization required
oLLM streams weights and KV cache from SSD to GPU layer-by-layer, keeping full fp16/bf16 precision while fitting massive contexts into consumer hardware.

What it does oLLM is a Python inference library for running large-context LLMs on modest GPUs by aggressively offloading to SSD and CPU. It loads layer weights from disk directly into GPU memory one at a time, shunts the KV cache to SSD, and optionally parks some layers on CPU—no quantization, no 4-bit tricks, just orchestrated memory juggling on top of Hugging Face Transformers and PyTorch.
The interesting bit The trade-off is explicit and almost retro: you need plenty of fast SSD space (180 GB for qwen3-next-80B, 69 GB for Llama-3.1-8B at 100k context), but in return you keep full precision and avoid the accuracy compromises of quantization. The library also chunks MLP layers and uses FlashAttention-2 with online softmax so the full attention matrix never materializes.
Key highlights
- Fits a 160 GB qwen3-next-80B model into ~7.5 GB VRAM with 50k context (throughput: roughly 1 token per 2 seconds)
- Supports Llama 3, Gemma 3, GPT-OSS-20B, Qwen3-Next, and multimodal models (Gemma 3 vision, Voxtral audio)
AutoInferenceclass allows any Llama 3 / Gemma 3 model with PEFT adapter support- Optional
kvikiofor faster NVIDIA SSD→GPU transfers; works on AMD and Apple Silicon too - No mandatory compiled extensions—
flash-attnandkvikioare optional
Caveats
- Throughput is modest; this is for offline batch work, not chatty interactive use
- Requires significant SSD space and presumably fast storage to avoid bottlenecking
- Installation needs
--no-build-isolation, suggesting some non-trivial native compilation
Verdict Worth a look if you need to process long documents locally and would rather trade speed for precision than quantize your model. If you need real-time responses or lack a roomy NVMe drive, this is not your tool.