Microsoft shrinks LLMs to 1.58 bits, runs 100B params on a single CPU
An inference engine that makes ternary-weight models practical on commodity hardware without accuracy loss.

What it does
bitnet.cpp is Microsoft’s inference framework for 1-bit (technically 1.58-bit ternary) LLMs. It runs quantized models on CPU and GPU using lookup-table kernels derived from the T-MAC project, built atop llama.cpp. The project ships with setup scripts, benchmark tools, and support for several Hugging Face models including its own 2.4B-parameter release.
The interesting bit
The 1.58-bit gimmick isn’t just compression theater—weights take values in {-1, 0, +1}, which lets the framework replace expensive multiplications with table lookups. Microsoft claims this enables a 100B-parameter model to run on a single CPU at human reading speed (5–7 tok/s), which is either impressive or slightly dystopian depending on your patience.
Key highlights
- CPU speedups of 1.37x–5.07x on ARM and 2.37x–6.17x on x86 versus baseline, per the project’s own benchmarks
- Energy reduction of 55–82% depending on architecture
- Additional 1.15x–2.1x speedup from recent parallel kernel optimizations with configurable tiling
- Supports official Microsoft 2B model plus community 1-bit variants (Llama3-8B, Falcon3 family, Falcon-E)
- GPU inference kernels added May 2025; NPU support listed as “coming next”
Caveats
- Kernel support is fragmented: no single kernel works across all models and architectures (check the compatibility table before assuming your setup runs)
- Requires clang≥18 and a somewhat involved build process; Windows users must use VS2022 Developer Prompt
- “100B model on one CPU” is technically true but at 5–7 tok/s, interactive use only
Verdict
Worth a look if you’re shipping edge devices or burning cloud budget on inference. Skip it if you need state-of-the-art quality from small models—the 1-bit constraint still trades accuracy for efficiency, and the supported model zoo is limited.