RTX 3090 owners finally get a cookbook for modern LLMs
Working Docker configs, measured TPS numbers, and honest gotchas for running Qwen3.6, Gemma 4, and friends on consumer GPUs—no cloud required.

What it does
This repo is a curated collection of working Docker Compose configs, patches, and benchmark scripts for serving modern LLMs locally on one or two RTX 3090s (with 4090/5090 support contributed and measured). Pick a model, run an interactive wizard, and get an OpenAI-compatible API on localhost:8020. Currently ships validated configs for Qwen3.6-27B, Qwen3.6-35B-A3B (MoE), Gemma 4 31B, and Gemma 4 26B-A4B (MoE).
The interesting bit
The project doesn’t pretend one engine rules them all. It offers two genuinely different paths: vLLM dual-card for raw throughput (up to 127 TPS on Qwen3.6-27B with speculative decoding) and llama.cpp single-card for stability (200K context on one 3090 without prefill OOM “cliffs”). The docs openly track which features are blocked where—vLLM single-card long context is broken on 24 GB, SGLang is blocked on Ampere, and Gemma 4 needs a community llama.cpp fork for head_dim=512.
Key highlights
- Multi-engine: vLLM, llama.cpp, and ik_llama (specialized GGUF quants) with auto-detected PCIe/NVLink dual-card setups
scripts/launch.shinteractive wizard resolves model + GPU count + VRAM budget to a working compose variantscripts/bench.shandscripts/quality-test.shfor reproducible TPS and behavioral benchmarks- “Universal pull” (v0.8.2) evaluates arbitrary HuggingFace safetensors repos against the repo’s KV math and gives an honest fit verdict
- Extensive docs on quantization tradeoffs, hardware gotchas, and a glossary for newcomers to local AI
Caveats
- Windows users need WSL2; native Windows only runs upstream llama.cpp, none of the repo’s tooling
- Single-card vLLM long context (>~50K prefill) is openly broken on 24 GB; workarounds are dual-card or switching to llama.cpp
- Some configs rely on unofficial multi-arch Docker images (sm_89/120 unvalidated) or community forks
Verdict
Homelabbers and devs sitting on 3090s who want modern models without API bills—this saves you weeks of VRAM math and broken composes. Cloud-native teams or Mac users need not apply.