scrya-com/rotorquant
KV cache quantization method for LLMs using block-diagonal rotations to compress transformer memory during inference.

Velocity · 7d
+14
★ / day
Trend
→steady
star history
RotorQuant applies block-diagonal rotation matrices to compress key-value cache in transformer models, achieving 10.3x compression with improved perplexity and throughput versus TurboQuant. It reduces decode latency by 28% and prefill speed by 5.3x through planar/isolated rotation strategies that avoid the O(d log d) butterfly network overhead. Supports drop-in integration with llama.cpp for deployment.