punica-ai/punica
A high-performance LLM serving system that supports multiple LoRA finetuned models simultaneously by sharing the base model and computing LoRA addons via a custom CUDA kernel.

Punica enables serving multiple LoRA finetuned LLMs as a single unified model, dramatically reducing memory and compute overhead. It exploits the batching effect of the shared pretrained model while computing LoRA addons efficiently through a specialized CUDA kernel called Segmented Gather Matrix-Vector multiplication (SGMV). The system achieves near-singular-input latency even when serving multiple different LoRA adapters concurrently.