← all repositories
Andyyyy64/whichllm

A shopping assistant for GPU-poor LLM enthusiasts

whichllm ranks local models by real benchmark scores, not parameter count, and tells you which ones actually fit your hardware.

whichllm
Velocity · 7d
+32
★ / day
Trend
steady
star history

What it does

whichllm auto-detects your GPU, CPU, and RAM, then queries live HuggingFace data to rank models that fit your system. It scores them by merged benchmarks (LiveBench, Artificial Analysis, Aider, Arena ELO, etc.) adjusted for evidence confidence, recency, quantization penalty, and estimated inference speed. One command: whichllm. Add --json for scripts, --gpu "RTX 4090" to window-shop hardware you don’t own, or run to download and chat immediately.

The interesting bit

The scoring is deliberately paranoid. Benchmark inheritance is rejected when a fork’s parameters diverge 2× from its family base, stopping small repackagers from borrowing a larger model’s score. Scores are tagged direct / variant / base / interpolated / self-reported and discounted accordingly. The README’s example is telling: a 32B model fits an RTX 4090 fine, but whichllm ranks a 27B #1 because it scores higher on actual benchmarks and is a newer generation. Size heuristics would get this wrong.

Key highlights

  • Architecture-aware VRAM estimation: weights + GQA KV cache + activations + ~500MB framework overhead, with partial-offload and unified-memory modeling
  • Speed estimates derived from memory bandwidth, quantization efficiency, per-backend factors, and MoE active-vs-total parameter splits
  • whichllm plan "llama 3 70b" does reverse lookup: what GPU do I need for this model?
  • whichllm upgrade compares your current machine against candidate GPUs before you buy
  • whichllm snippet emits copy-paste Python using llama-cpp-python or transformers
  • Live data with frozen fallbacks; 6h model cache, 24h benchmark cache

Caveats

  • Speed figures are planning estimates from bandwidth models, not live benchmarks; JSON output carries speed_confidence and speed_range_tok_per_sec to signal uncertainty
  • Ollama integration requires manual mapping: HuggingFace repo IDs don’t always match Ollama model names
  • The README notes benchmark snapshot dates are printed under rankings, but it’s unclear how aggressively stale data is demoted in practice

Verdict

Useful if you’re tired of downloading 70B models that OOM or underperform. Overkill if you already know exactly which quantized GGUF runs on your single GPU. The --gpu simulation mode is genuinely handy for pre-purchase anxiety.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.