predibase/lorax
A framework for serving thousands of LoRA-adapted fine-tuned language models on a single GPU with dynamic adapter loading.

LoRAX enables efficient multi-tenant inference by dynamically loading and serving fine-tuned LoRA adapters across thousands of models on a single GPU. It leverages PyTorch and HuggingFace Transformers to manage adapter weights per request, supporting HuggingFace Hub, Predibase, and local filesystem sources. The system handles concurrent requests by loading adapters just-in-time without blocking, and can merge adapters per request to create ensembles. It exposes REST API, Python client, and OpenAI-compatible interfaces for inference.