Baidu's low-level stress test for AI hardware
A benchmarking suite that measures the raw ingredients of deep learning—GEMMs, convolutions, and all-reduce—rather than full models.

What it does DeepBench benchmarks the fundamental operations that underpin deep learning—dense matrix multiplies, convolutions, recurrent layers, and all-reduce communication—across different hardware platforms. It deliberately ignores frameworks and end-to-end models, focusing instead on the low-level kernels that hardware vendors and simulator builders actually need to optimize.
The interesting bit The project treats deep learning performance as a decomposable problem. Rather than benchmarking “ResNet on GPU A vs GPU B,” it asks: given specific matrix dimensions and convolution parameters, which hardware and library combination wins? The README includes detailed topology diagrams for multi-GPU systems and specifies exact precision requirements, making it usable as a hardware simulator input.
Key highlights
- Covers training and inference with separate Excel spreadsheets defining all problem sizes (
DeepBenchKernels_train.xlsxandDeepBenchKernels_inference.xlsx) - Tests seven training platforms (NVIDIA TitanX through P100, plus Intel Knights Landing) and six inference platforms including mobile (iPhone 6/7, Raspberry Pi 3)
- Evaluates all-reduce using four different communication libraries (NCCL, OSU, Baidu’s own allreduce, Intel MLSL) and reports the best latency per configuration
- Uses only vendor-supplied libraries, accepting that published faster implementations exist but aren’t what most users actually run
- Includes detailed hardware topology schematics for 8-GPU and 10-GPU NVIDIA systems
Caveats
- The README is truncated mid-sentence in the 10 GPU system topology section, leaving that documentation incomplete
- Recurrent layer benchmarks explicitly exclude input-to-hidden calculations and input gradients, so they measure only a subset of real recurrent layer work
- No support for asynchronous distributed training methods in the all-reduce benchmark
Verdict Hardware engineers, simulator developers, and anyone building custom AI silicon should bookmark this. If you’re choosing between cloud GPU instances based on end-to-end training cost, look elsewhere—this won’t tell you how fast your actual model trains.