Yes — Mega4alik/ollm is open source, released under the MIT license.

What language is ollm written in?

Mega4alik/ollm is primarily written in Python.

Mega4alik/ollm has 2.7k stars on GitHub.

Where can I find ollm?

Mega4alik/ollm is on GitHub at https://github.com/Mega4alik/ollm.

Mega4alik/ollm

Run a 160GB model on an 8GB GPU—no quantization required

oLLM streams weights and KV cache from SSD to GPU layer-by-layer, keeping full fp16/bf16 precision while fitting massive contexts into consumer hardware.

★2.7k stars Python Inference · Serving Language Models ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does oLLM is a Python inference library for running large-context LLMs on modest GPUs by aggressively offloading to SSD and CPU. It loads layer weights from disk directly into GPU memory one at a time, shunts the KV cache to SSD, and optionally parks some layers on CPU—no quantization, no 4-bit tricks, just orchestrated memory juggling on top of Hugging Face Transformers and PyTorch.

The interesting bit The trade-off is explicit and almost retro: you need plenty of fast SSD space (180 GB for qwen3-next-80B, 69 GB for Llama-3.1-8B at 100k context), but in return you keep full precision and avoid the accuracy compromises of quantization. The library also chunks MLP layers and uses FlashAttention-2 with online softmax so the full attention matrix never materializes.

Key highlights

Fits a 160 GB qwen3-next-80B model into ~7.5 GB VRAM with 50k context (throughput: roughly 1 token per 2 seconds)
Supports Llama 3, Gemma 3, GPT-OSS-20B, Qwen3-Next, and multimodal models (Gemma 3 vision, Voxtral audio)
AutoInference class allows any Llama 3 / Gemma 3 model with PEFT adapter support
Optional kvikio for faster NVIDIA SSD→GPU transfers; works on AMD and Apple Silicon too
No mandatory compiled extensions—flash-attn and kvikio are optional

Caveats

Throughput is modest; this is for offline batch work, not chatty interactive use
Requires significant SSD space and presumably fast storage to avoid bottlenecking
Installation needs --no-build-isolation, suggesting some non-trivial native compilation

Verdict Worth a look if you need to process long documents locally and would rather trade speed for precision than quantize your model. If you need real-time responses or lack a roomy NVMe drive, this is not your tool.

Frequently asked

What is Mega4alik/ollm?: oLLM streams weights and KV cache from SSD to GPU layer-by-layer, keeping full fp16/bf16 precision while fitting massive contexts into consumer hardware.
Is ollm open source?: Yes — Mega4alik/ollm is open source, released under the MIT license.
What language is ollm written in?: Mega4alik/ollm is primarily written in Python.
How popular is ollm?: Mega4alik/ollm has 2.7k stars on GitHub.
Where can I find ollm?: Mega4alik/ollm is on GitHub at https://github.com/Mega4alik/ollm.