Is kvcached open source?

Yes — ovg-project/kvcached is open source, released under the Apache-2.0 license.

What language is kvcached written in?

ovg-project/kvcached is primarily written in Python.

How popular is kvcached?

ovg-project/kvcached has 1.1k stars on GitHub.

Where can I find kvcached?

ovg-project/kvcached is on GitHub at https://github.com/ovg-project/kvcached.

← all repositories

ovg-project/kvcached

Paging for attention: elastic GPU memory for LLMs

kvcached treats GPU memory like virtual RAM so LLM serving engines can share VRAM elastically instead of reserving it statically at boot.

★1.1k stars Python Inference · Serving LLMOps · Eval

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

kvcached is a library that plugs into vLLM and SGLang to make KV cache allocation elastic. Instead of statically reserving GPU memory at startup, engines reserve virtual address space and back it with physical VRAM only when the cache is actively used. This lets multiple models, serverless workers, or training jobs share a GPU without the rigid memory walls common today.

The interesting bit

The clever part is applying OS-style virtual memory paging to GPU tensors: logical KV cache addresses are decoupled from physical allocation, so idle models can effectively page out their footprint without being killed. A frontend router and sleep mode add a control plane that can park models and wake them on demand, which is what makes serverless and multi-tenant colocation practical.

Key highlights

Engine support: Works as a plugin for SGLang and vLLM, handling MHA, GQA, and MLA attention (DeepSeek-V3, GPT-OSS, Llama 3, Qwen 2.5).
Elastic allocation: Reclaims and reallocates KV memory dynamically; idle models can be put to sleep via the built-in router.
Prefix caching: Supports automatic prefix caching for vLLM and RadixCache for SGLang without breaking elastic bounds.
Measured impact: Benchmarks on an A100-80G with three Llama-3.1-8B models report 2–28× lower TTFT during intermittent peaks versus static reservation.
Production footprint: Red Hat’s Sardeenz builds on kvcached for dynamic multi-model serving on Kubernetes/OpenShift.

Caveats

Version coverage gaps: The authors explicitly warn they have not tested every SGLang and vLLM release, so edge cases outside the verified ranges are likely.
Platform packaging: GB200 Docker images are noted as still on the way, though standard images exist.

Verdict

Worth a look if you are trying to squeeze multiple models, serverless workers, or mixed workloads onto the same GPU. If you already dedicate one static model per card and max it out, the abstraction is likely overkill.

Frequently asked

What is ovg-project/kvcached?: kvcached treats GPU memory like virtual RAM so LLM serving engines can share VRAM elastically instead of reserving it statically at boot.
Is kvcached open source?: Yes — ovg-project/kvcached is open source, released under the Apache-2.0 license.
What language is kvcached written in?: ovg-project/kvcached is primarily written in Python.
How popular is kvcached?: ovg-project/kvcached has 1.1k stars on GitHub.
Where can I find kvcached?: ovg-project/kvcached is on GitHub at https://github.com/ovg-project/kvcached.