← all repositories
microsoft/BitNet

Microsoft shrinks LLMs to 1.58 bits, runs 100B params on a single CPU

An inference engine that makes ternary-weight models practical on commodity hardware without accuracy loss.

BitNet
Velocity · 7d
+58
★ / day
Trend
steady
star history

What it does

bitnet.cpp is Microsoft’s inference framework for 1-bit (technically 1.58-bit ternary) LLMs. It runs quantized models on CPU and GPU using lookup-table kernels derived from the T-MAC project, built atop llama.cpp. The project ships with setup scripts, benchmark tools, and support for several Hugging Face models including its own 2.4B-parameter release.

The interesting bit

The 1.58-bit gimmick isn’t just compression theater—weights take values in {-1, 0, +1}, which lets the framework replace expensive multiplications with table lookups. Microsoft claims this enables a 100B-parameter model to run on a single CPU at human reading speed (5–7 tok/s), which is either impressive or slightly dystopian depending on your patience.

Key highlights

  • CPU speedups of 1.37x–5.07x on ARM and 2.37x–6.17x on x86 versus baseline, per the project’s own benchmarks
  • Energy reduction of 55–82% depending on architecture
  • Additional 1.15x–2.1x speedup from recent parallel kernel optimizations with configurable tiling
  • Supports official Microsoft 2B model plus community 1-bit variants (Llama3-8B, Falcon3 family, Falcon-E)
  • GPU inference kernels added May 2025; NPU support listed as “coming next”

Caveats

  • Kernel support is fragmented: no single kernel works across all models and architectures (check the compatibility table before assuming your setup runs)
  • Requires clang≥18 and a somewhat involved build process; Windows users must use VS2022 Developer Prompt
  • “100B model on one CPU” is technically true but at 5–7 tok/s, interactive use only

Verdict

Worth a look if you’re shipping edge devices or burning cloud budget on inference. Skip it if you need state-of-the-art quality from small models—the 1-bit constraint still trades accuracy for efficiency, and the supported model zoo is limited.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.