← all repositories
mlcommons/training

The official ML training benchmarks, warts and all

Reference implementations for MLPerf's training suite, explicitly not optimized for real performance testing.

1.8k stars Python LLMOps · EvalML Frameworks
training
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

This repo houses the reference implementations for MLPerf Training benchmarks — the standardized tests hardware vendors and framework authors use to prove their systems can train big models fast. Each benchmark includes model code, a Dockerfile, dataset download instructions, and a timing script. Think of it as the starting pistol, not the finish line.

The interesting bit

The README is unusually honest: these implementations are “alpha or beta quality,” “not fully optimized,” and “not intended to be used for ‘real’ performance measurements.” The project essentially exists to be replaced — submitters are expected to bring their own optimized frameworks and hardware. It’s a spec with training wheels attached.

Key highlights

  • Covers the full MLPerf Training suite: vision (ResNet, RetinaNet, Stable Diffusion), NLP (BERT, GPT-3, Llama variants), recommendation (DLRM-DCNv2), and graph neural networks (RGAT)
  • v6.0 (deadline May 2026) adds newer heavy hitters: Flux.1 text-to-image, Llama 3.1 405B, DeepSeek-V3 MoE at 671B parameters, and GPT-OSS 20B
  • Each benchmark containerized with Docker; includes verify_dataset.sh for sanity-checking downloads
  • Datasets are large and external: Criteo’s 3.5TB multi-hot recommendation data, LAION-400M-filtered, IGBH-Full graph dataset
  • Submitters can use any framework; reference implementations span PyTorch, TensorFlow, NeMo, torchtitan, PaxML/Megatron-LM, and Primus

Caveats

  • README warns benchmarks are “rather slow or take a long time to run on the reference hardware”
  • Some README links appear malformed (broken bracket syntax in v6.0 table for GPT-OSS and DeepSeek)
  • Dataset downloads happen outside Docker and may assume specific working directory behavior

Verdict

Grab this if you’re preparing an MLPerf submission and need the official starting point, or if you want to understand what the industry considers a representative training workload. Skip it if you’re looking for production-ready training code or expect to run meaningful hardware comparisons out of the box — the README explicitly tells you not to.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.