← all repositories
hendrycks/robustness

The benchmark that made "robust" models prove it

ImageNet-C and ImageNet-P are standardized corruption and perturbation datasets that let you measure how badly your vision model fails when reality gets messy.

robustness
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

This repo hosts the datasets and evaluation code for Hendrycks & Dietterich’s ICLR 2019 paper. ImageNet-C applies 15 types of corruption (noise, blur, weather, digital) at five severity levels to standard ImageNet images. ImageNet-P applies smaller perturbations over time, producing sequences that test whether models flip predictions as inputs shift gradually. There are scaled-down versions for CIFAR-10/100 and Tiny ImageNet if you lack GPU acreage.

The interesting bit

The README includes a maintained leaderboard with mean Corruption Error (mCE) scores, and it explicitly flags which methods are standalone versus cocktail approaches — a small but unusual bit of honesty in benchmark culture. The authors also nudge you to use their pre-computed JPEGs rather than generating corruptions on the fly, which keeps evaluation deterministic and comparable across papers.

Key highlights

  • ImageNet-C: 1000 classes, 15 corruption types × 5 severities, pre-rendered JPEGs for consistency
  • ImageNet-P: temporal perturbation sequences (MP4s, not GIFs) measuring flip rates and top-5 distance
  • Leaderboard tracks both standalone and combined methods separately
  • CIFAR-10-C, CIFAR-100-C, Tiny ImageNet-C/P available for faster iteration
  • Requires only Python 3+ and PyTorch 0.3+; evaluation code is minimal

Caveats

  • The repo is mostly datasets plus “some code” — the README’s phrasing, not mine — so expect glue rather than a training framework
  • PyTorch 0.3+ requirement suggests the core code hasn’t been refreshed for modern versions

Verdict

Grab this if you’re building or evaluating vision models and need a standardized way to argue that your approach actually handles real-world degradation. Skip it if you’re looking for an end-to-end training pipeline; this is a measuring stick, not a toolbox.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.