← all repositories
ggml-org/llama.cpp

Run LLMs on hardware that shouldn't run LLMs

A dependency-free C/C++ inference engine that squeezes large language models onto laptops, phones, and browsers through aggressive quantization and hand-rolled kernels.

llama.cpp
Velocity · 7d
+97
★ / day
Trend
steady
star history

What it does

llama.cpp runs LLM inference in plain C/C++ with zero dependencies. It targets everything from Apple Silicon to RISC-V, x86, and NVIDIA/AMD GPUs, offering 1.5-bit through 8-bit quantization to shrink models until they fit your available RAM or VRAM. A built-in OpenAI-compatible server (llama-server) and CLI tool (llama-cli) let you pull models directly from Hugging Face and start generating.

The interesting bit

The project treats Apple Silicon as a first-class citizen—unusual for open-source ML infrastructure—while also pioneering CPU+GPU hybrid inference so models larger than your VRAM don’t simply crash. It doubles as the testbed for ggml, the underlying tensor library, meaning new backends (WebGPU, Vulkan, SYCL) and quantization schemes often land here first.

Key highlights

  • Supports 60+ model families including LLaMA, Mistral, Mixtral, DeepSeek, Qwen, Gemma, and multimodal LLaVA variants
  • Native backends: Metal, CUDA, HIP, Vulkan, SYCL, plus browser-based WebGPU via WASM
  • Aggressive quantization: 1.5-bit to 8-bit integer formats, plus native MXFP4 support for NVIDIA’s gpt-oss collaboration
  • Ecosystem bindings span Python, Rust, Go, Node.js, C#, Ruby, Scala, Clojure, and browser WASM
  • VS Code and Vim/Neovim plugins for fill-in-the-middle code completion

Caveats

  • API churn is real: dedicated changelog issues track breaking changes for both libllama and llama-server
  • Packaging remains a work-in-progress; the maintainers are actively soliciting feedback on better downstream distribution
  • Model support is broad but implementation-driven—check the checklist before assuming your fine-tune works out of the box

Verdict

Essential if you need local inference on consumer hardware or want to ship LLMs in resource-constrained environments. Skip it if you’re already happy with cloud APIs and don’t care about quantization trade-offs or self-hosting complexity.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.