← all repositories
noonghunna/club-3090

RTX 3090 owners finally get a cookbook for modern LLMs

Working Docker configs, measured TPS numbers, and honest gotchas for running Qwen3.6, Gemma 4, and friends on consumer GPUs—no cloud required.

club-3090
Velocity · 7d
+31
★ / day
Trend
steady
star history

What it does

This repo is a curated collection of working Docker Compose configs, patches, and benchmark scripts for serving modern LLMs locally on one or two RTX 3090s (with 4090/5090 support contributed and measured). Pick a model, run an interactive wizard, and get an OpenAI-compatible API on localhost:8020. Currently ships validated configs for Qwen3.6-27B, Qwen3.6-35B-A3B (MoE), Gemma 4 31B, and Gemma 4 26B-A4B (MoE).

The interesting bit

The project doesn’t pretend one engine rules them all. It offers two genuinely different paths: vLLM dual-card for raw throughput (up to 127 TPS on Qwen3.6-27B with speculative decoding) and llama.cpp single-card for stability (200K context on one 3090 without prefill OOM “cliffs”). The docs openly track which features are blocked where—vLLM single-card long context is broken on 24 GB, SGLang is blocked on Ampere, and Gemma 4 needs a community llama.cpp fork for head_dim=512.

Key highlights

  • Multi-engine: vLLM, llama.cpp, and ik_llama (specialized GGUF quants) with auto-detected PCIe/NVLink dual-card setups
  • scripts/launch.sh interactive wizard resolves model + GPU count + VRAM budget to a working compose variant
  • scripts/bench.sh and scripts/quality-test.sh for reproducible TPS and behavioral benchmarks
  • “Universal pull” (v0.8.2) evaluates arbitrary HuggingFace safetensors repos against the repo’s KV math and gives an honest fit verdict
  • Extensive docs on quantization tradeoffs, hardware gotchas, and a glossary for newcomers to local AI

Caveats

  • Windows users need WSL2; native Windows only runs upstream llama.cpp, none of the repo’s tooling
  • Single-card vLLM long context (>~50K prefill) is openly broken on 24 GB; workarounds are dual-card or switching to llama.cpp
  • Some configs rely on unofficial multi-arch Docker images (sm_89/120 unvalidated) or community forks

Verdict

Homelabbers and devs sitting on 3090s who want modern models without API bills—this saves you weeks of VRAM math and broken composes. Cloud-native teams or Mac users need not apply.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.