← all repositories

scrya-com/rotorquant

KV cache quantization method for LLMs using block-diagonal rotations to compress transformer memory during inference.

1k stars Python Inference · Serving
rotorquant
Velocity · 7d
+14
★ / day
Trend
steady
star history

RotorQuant applies block-diagonal rotation matrices to compress key-value cache in transformer models, achieving 10.3x compression with improved perplexity and throughput versus TurboQuant. It reduces decode latency by 28% and prefill speed by 5.3x through planar/isolated rotation strategies that avoid the O(d log d) butterfly network overhead. Supports drop-in integration with llama.cpp for deployment.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.