XiongjieDai/GPU-Benchmarks-on-LLM-Inference
A benchmark suite comparing inference throughput and memory capacity of various GPUs running LLaMA models via llama.cpp.

This repository benchmarks large language model inference performance across a wide range of hardware, including NVIDIA GPUs from RTX 3070 to A100, Apple Silicon (M1, M2 Ultra, M3 Max), and multi-GPU configurations. It uses llama.cpp to test throughput (tokens/second) and out-of-memory behavior for LLaMA 3 models in different quantizations (Q4_K_M, F16) at 8B and 70B scales. Results are presented as comparative tables showing tokens-per-second across hardware configurations.