Run LLMs on hardware that shouldn't run LLMs
A dependency-free C/C++ inference engine that squeezes large language models onto laptops, phones, and browsers through aggressive quantization and hand-rolled kernels.

What it does
llama.cpp runs LLM inference in plain C/C++ with zero dependencies. It targets everything from Apple Silicon to RISC-V, x86, and NVIDIA/AMD GPUs, offering 1.5-bit through 8-bit quantization to shrink models until they fit your available RAM or VRAM. A built-in OpenAI-compatible server (llama-server) and CLI tool (llama-cli) let you pull models directly from Hugging Face and start generating.
The interesting bit
The project treats Apple Silicon as a first-class citizen—unusual for open-source ML infrastructure—while also pioneering CPU+GPU hybrid inference so models larger than your VRAM don’t simply crash. It doubles as the testbed for ggml, the underlying tensor library, meaning new backends (WebGPU, Vulkan, SYCL) and quantization schemes often land here first.
Key highlights
- Supports 60+ model families including LLaMA, Mistral, Mixtral, DeepSeek, Qwen, Gemma, and multimodal LLaVA variants
- Native backends: Metal, CUDA, HIP, Vulkan, SYCL, plus browser-based WebGPU via WASM
- Aggressive quantization: 1.5-bit to 8-bit integer formats, plus native MXFP4 support for NVIDIA’s gpt-oss collaboration
- Ecosystem bindings span Python, Rust, Go, Node.js, C#, Ruby, Scala, Clojure, and browser WASM
- VS Code and Vim/Neovim plugins for fill-in-the-middle code completion
Caveats
- API churn is real: dedicated changelog issues track breaking changes for both
libllamaandllama-server - Packaging remains a work-in-progress; the maintainers are actively soliciting feedback on better downstream distribution
- Model support is broad but implementation-driven—check the checklist before assuming your fine-tune works out of the box
Verdict
Essential if you need local inference on consumer hardware or want to ship LLMs in resource-constrained environments. Skip it if you’re already happy with cloud APIs and don’t care about quantization trade-offs or self-hosting complexity.