A shopping assistant for GPU-poor LLM enthusiasts
whichllm ranks local models by real benchmark scores, not parameter count, and tells you which ones actually fit your hardware.

What it does
whichllm auto-detects your GPU, CPU, and RAM, then queries live HuggingFace data to rank models that fit your system. It scores them by merged benchmarks (LiveBench, Artificial Analysis, Aider, Arena ELO, etc.) adjusted for evidence confidence, recency, quantization penalty, and estimated inference speed. One command: whichllm. Add --json for scripts, --gpu "RTX 4090" to window-shop hardware you don’t own, or run to download and chat immediately.
The interesting bit
The scoring is deliberately paranoid. Benchmark inheritance is rejected when a fork’s parameters diverge 2× from its family base, stopping small repackagers from borrowing a larger model’s score. Scores are tagged direct / variant / base / interpolated / self-reported and discounted accordingly. The README’s example is telling: a 32B model fits an RTX 4090 fine, but whichllm ranks a 27B #1 because it scores higher on actual benchmarks and is a newer generation. Size heuristics would get this wrong.
Key highlights
- Architecture-aware VRAM estimation: weights + GQA KV cache + activations + ~500MB framework overhead, with partial-offload and unified-memory modeling
- Speed estimates derived from memory bandwidth, quantization efficiency, per-backend factors, and MoE active-vs-total parameter splits
whichllm plan "llama 3 70b"does reverse lookup: what GPU do I need for this model?whichllm upgradecompares your current machine against candidate GPUs before you buywhichllm snippetemits copy-paste Python usingllama-cpp-pythonortransformers- Live data with frozen fallbacks; 6h model cache, 24h benchmark cache
Caveats
- Speed figures are planning estimates from bandwidth models, not live benchmarks; JSON output carries
speed_confidenceandspeed_range_tok_per_secto signal uncertainty - Ollama integration requires manual mapping: HuggingFace repo IDs don’t always match Ollama model names
- The README notes benchmark snapshot dates are printed under rankings, but it’s unclear how aggressively stale data is demoted in practice
Verdict
Useful if you’re tired of downloading 70B models that OOM or underperform. Overkill if you already know exactly which quantized GGUF runs on your single GPU. The --gpu simulation mode is genuinely handy for pre-purchase anxiety.