← all repositories

XiongjieDai/GPU-Benchmarks-on-LLM-Inference

A benchmark suite comparing inference throughput and memory capacity of various GPUs running LLaMA models via llama.cpp.

1.9k stars Jupyter Notebook Inference · ServingLanguage Models
GPU-Benchmarks-on-LLM-Inference
Velocity · 7d
+1.8
★ / day
Trend
steady
star history

This repository benchmarks large language model inference performance across a wide range of hardware, including NVIDIA GPUs from RTX 3070 to A100, Apple Silicon (M1, M2 Ultra, M3 Max), and multi-GPU configurations. It uses llama.cpp to test throughput (tokens/second) and out-of-memory behavior for LLaMA 3 models in different quantizations (Q4_K_M, F16) at 8B and 70B scales. Results are presented as comparative tables showing tokens-per-second across hardware configurations.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.