ggml-org/llama.cpp

Run LLMs on hardware that shouldn't run LLMs

A dependency-free C/C++ inference engine that squeezes large language models onto laptops, phones, and browsers through aggressive quantization and hand-rolled kernels.

★115.4k stars C++ Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Velocity · 7d

+97

★ / day

Trend

→steady

star history

What it does

llama.cpp runs LLM inference in plain C/C++ with zero dependencies. It targets everything from Apple Silicon to RISC-V, x86, and NVIDIA/AMD GPUs, offering 1.5-bit through 8-bit quantization to shrink models until they fit your available RAM or VRAM. A built-in OpenAI-compatible server (llama-server) and CLI tool (llama-cli) let you pull models directly from Hugging Face and start generating.

The interesting bit

The project treats Apple Silicon as a first-class citizen—unusual for open-source ML infrastructure—while also pioneering CPU+GPU hybrid inference so models larger than your VRAM don’t simply crash. It doubles as the testbed for ggml, the underlying tensor library, meaning new backends (WebGPU, Vulkan, SYCL) and quantization schemes often land here first.

Key highlights

Supports 60+ model families including LLaMA, Mistral, Mixtral, DeepSeek, Qwen, Gemma, and multimodal LLaVA variants
Native backends: Metal, CUDA, HIP, Vulkan, SYCL, plus browser-based WebGPU via WASM
Aggressive quantization: 1.5-bit to 8-bit integer formats, plus native MXFP4 support for NVIDIA’s gpt-oss collaboration
Ecosystem bindings span Python, Rust, Go, Node.js, C#, Ruby, Scala, Clojure, and browser WASM
VS Code and Vim/Neovim plugins for fill-in-the-middle code completion

Caveats

API churn is real: dedicated changelog issues track breaking changes for both libllama and llama-server
Packaging remains a work-in-progress; the maintainers are actively soliciting feedback on better downstream distribution
Model support is broad but implementation-driven—check the checklist before assuming your fine-tune works out of the box

Verdict

Essential if you need local inference on consumer hardware or want to ship LLMs in resource-constrained environments. Skip it if you’re already happy with cloud APIs and don’t care about quantization trade-offs or self-hosting complexity.