ikawrakow/ik_llama.cpp

A high-performance fork of llama.cpp adding advanced quantization methods and optimized hybrid CPU/GPU inference for large language models.

★2.7k stars C++ Inference · Serving Language Models

View on GitHub ↗

Velocity · 7d

+3.8

★ / day

Trend

→steady

star history

This repository extends llama.cpp with state-of-the-art quantization formats and performance improvements for running large language models. It adds row-interleaved quant packing, fused MoE operations, FlashMLA optimizations, and first-class Bitnet support. The focus is on efficient inference using hybrid CPU/GPU compute backends with improved memory utilization and throughput.