← all repositories
Mega4alik/ollm

Run a 160GB model on an 8GB GPU—no quantization required

oLLM streams weights and KV cache from SSD to GPU layer-by-layer, keeping full fp16/bf16 precision while fitting massive contexts into consumer hardware.

ollm
Velocity · 7d
+9.0
★ / day
Trend
steady
star history

What it does oLLM is a Python inference library for running large-context LLMs on modest GPUs by aggressively offloading to SSD and CPU. It loads layer weights from disk directly into GPU memory one at a time, shunts the KV cache to SSD, and optionally parks some layers on CPU—no quantization, no 4-bit tricks, just orchestrated memory juggling on top of Hugging Face Transformers and PyTorch.

The interesting bit The trade-off is explicit and almost retro: you need plenty of fast SSD space (180 GB for qwen3-next-80B, 69 GB for Llama-3.1-8B at 100k context), but in return you keep full precision and avoid the accuracy compromises of quantization. The library also chunks MLP layers and uses FlashAttention-2 with online softmax so the full attention matrix never materializes.

Key highlights

  • Fits a 160 GB qwen3-next-80B model into ~7.5 GB VRAM with 50k context (throughput: roughly 1 token per 2 seconds)
  • Supports Llama 3, Gemma 3, GPT-OSS-20B, Qwen3-Next, and multimodal models (Gemma 3 vision, Voxtral audio)
  • AutoInference class allows any Llama 3 / Gemma 3 model with PEFT adapter support
  • Optional kvikio for faster NVIDIA SSD→GPU transfers; works on AMD and Apple Silicon too
  • No mandatory compiled extensions—flash-attn and kvikio are optional

Caveats

  • Throughput is modest; this is for offline batch work, not chatty interactive use
  • Requires significant SSD space and presumably fast storage to avoid bottlenecking
  • Installation needs --no-build-isolation, suggesting some non-trivial native compilation

Verdict Worth a look if you need to process long documents locally and would rather trade speed for precision than quantize your model. If you need real-time responses or lack a roomy NVMe drive, this is not your tool.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.