ikawrakow/ik_llama.cpp
A high-performance fork of llama.cpp adding advanced quantization methods and optimized hybrid CPU/GPU inference for large language models.

Velocity · 7d
+3.8
★ / day
Trend
→steady
star history
This repository extends llama.cpp with state-of-the-art quantization formats and performance improvements for running large language models. It adds row-interleaved quant packing, fused MoE operations, FlashMLA optimizations, and first-class Bitnet support. The focus is on efficient inference using hybrid CPU/GPU compute backends with improved memory utilization and throughput.