Is llama-swap open source?

Yes — mostlygeek/llama-swap is open source, released under the MIT license.

What language is llama-swap written in?

mostlygeek/llama-swap is primarily written in Go.

How popular is llama-swap?

mostlygeek/llama-swap has 5.1k stars on GitHub and is currently accelerating.

Where can I find llama-swap?

mostlygeek/llama-swap is on GitHub at https://github.com/mostlygeek/llama-swap.

← all repositories

mostlygeek/llama-swap

A traffic cop for your GPU-starved LLM servers

Hot-swap between local AI models without keeping them all resident in VRAM.

★5.1k stars Go Inference · Serving

View on GitHub ↗

Velocity · 7d

+18

★ / day

Trend

↗accelerating

star history

What it does

llama-swap sits between your client and any OpenAI- or Anthropic-compatible local inference server—llama.cpp, vLLM, tabbyAPI, even stable-diffusion.cpp. It reads the model field from each request, starts the right upstream server if it isn’t running, and tears down the old one. One model in memory at a time by default; a matrix DSL lets you run concurrent groups when you have the VRAM to spare.

The interesting bit

The project treats inference servers as disposable processes rather than pets. A minimal config is literally three YAML lines: model ID, and the shell command to launch it on an auto-assigned ${PORT}. Everything else—TTL-based unloading, request parameter rewriting, Docker/Podman lifecycle hooks, API key gating—is optional sugar on top. The built-in web UI streams logs, shows token metrics, and lets you manually load or evict models without touching a terminal.

Key highlights

Zero runtime dependencies; single Go binary plus one config file
Supports OpenAI chat/completions/embeddings/images/audio, Anthropic messages, llama.cpp reranking and infill, plus stable-diffusion.cpp txt2img/img2img
Prometheus /metrics endpoint and per-model log streaming over HTTP
Preload models on startup, alias names like “gpt-4o-mini”, and filter/rewrite request parameters before they hit the upstream
Docker images bundle llama-server, whisper.cpp, and stable-diffusion.cpp for turnkey CUDA or Vulkan deployment

Caveats

Python-based servers such as vLLM should be containerized; the README notes they need clean SIGTERM handling that raw processes may not provide
Streaming behind nginx requires disabling proxy buffering; the README includes config snippets but it’s an easy footgun

Verdict

Ideal if you’re self-hosting multiple models on limited VRAM and want a single API endpoint that behaves like OpenAI. Skip it if you run one model 24/7 or already have a Kubernetes setup managing pod lifecycles.

Frequently asked

What is mostlygeek/llama-swap?: Hot-swap between local AI models without keeping them all resident in VRAM.
Is llama-swap open source?: Yes — mostlygeek/llama-swap is open source, released under the MIT license.
What language is llama-swap written in?: mostlygeek/llama-swap is primarily written in Go.
How popular is llama-swap?: mostlygeek/llama-swap has 5.1k stars on GitHub and is currently accelerating.
Where can I find llama-swap?: mostlygeek/llama-swap is on GitHub at https://github.com/mostlygeek/llama-swap.