← all repositories

punica-ai/punica

A high-performance LLM serving system that supports multiple LoRA finetuned models simultaneously by sharing the base model and computing LoRA addons via a custom CUDA kernel.

punica
Velocity · 7d
+1.1
★ / day
Trend
steady
star history

Punica enables serving multiple LoRA finetuned LLMs as a single unified model, dramatically reducing memory and compute overhead. It exploits the batching effect of the shared pretrained model while computing LoRA addons efficiently through a specialized CUDA kernel called Segmented Gather Matrix-Vector multiplication (SGMV). The system achieves near-singular-input latency even when serving multiple different LoRA adapters concurrently.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.