← all repositories
baidu-research/DeepBench

Baidu's low-level stress test for AI hardware

A benchmarking suite that measures the raw ingredients of deep learning—GEMMs, convolutions, and all-reduce—rather than full models.

1.1k stars C++ LLMOps · Eval
DeepBench
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does DeepBench benchmarks the fundamental operations that underpin deep learning—dense matrix multiplies, convolutions, recurrent layers, and all-reduce communication—across different hardware platforms. It deliberately ignores frameworks and end-to-end models, focusing instead on the low-level kernels that hardware vendors and simulator builders actually need to optimize.

The interesting bit The project treats deep learning performance as a decomposable problem. Rather than benchmarking “ResNet on GPU A vs GPU B,” it asks: given specific matrix dimensions and convolution parameters, which hardware and library combination wins? The README includes detailed topology diagrams for multi-GPU systems and specifies exact precision requirements, making it usable as a hardware simulator input.

Key highlights

  • Covers training and inference with separate Excel spreadsheets defining all problem sizes (DeepBenchKernels_train.xlsx and DeepBenchKernels_inference.xlsx)
  • Tests seven training platforms (NVIDIA TitanX through P100, plus Intel Knights Landing) and six inference platforms including mobile (iPhone 6/7, Raspberry Pi 3)
  • Evaluates all-reduce using four different communication libraries (NCCL, OSU, Baidu’s own allreduce, Intel MLSL) and reports the best latency per configuration
  • Uses only vendor-supplied libraries, accepting that published faster implementations exist but aren’t what most users actually run
  • Includes detailed hardware topology schematics for 8-GPU and 10-GPU NVIDIA systems

Caveats

  • The README is truncated mid-sentence in the 10 GPU system topology section, leaving that documentation incomplete
  • Recurrent layer benchmarks explicitly exclude input-to-hidden calculations and input gradients, so they measure only a subset of real recurrent layer work
  • No support for asynchronous distributed training methods in the all-reduce benchmark

Verdict Hardware engineers, simulator developers, and anyone building custom AI silicon should bookmark this. If you’re choosing between cloud GPU instances based on end-to-end training cost, look elsewhere—this won’t tell you how fast your actual model trains.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.