The benchmark that made "robust" models prove it
ImageNet-C and ImageNet-P are standardized corruption and perturbation datasets that let you measure how badly your vision model fails when reality gets messy.

What it does
This repo hosts the datasets and evaluation code for Hendrycks & Dietterich’s ICLR 2019 paper. ImageNet-C applies 15 types of corruption (noise, blur, weather, digital) at five severity levels to standard ImageNet images. ImageNet-P applies smaller perturbations over time, producing sequences that test whether models flip predictions as inputs shift gradually. There are scaled-down versions for CIFAR-10/100 and Tiny ImageNet if you lack GPU acreage.
The interesting bit
The README includes a maintained leaderboard with mean Corruption Error (mCE) scores, and it explicitly flags which methods are standalone versus cocktail approaches — a small but unusual bit of honesty in benchmark culture. The authors also nudge you to use their pre-computed JPEGs rather than generating corruptions on the fly, which keeps evaluation deterministic and comparable across papers.
Key highlights
- ImageNet-C: 1000 classes, 15 corruption types × 5 severities, pre-rendered JPEGs for consistency
- ImageNet-P: temporal perturbation sequences (MP4s, not GIFs) measuring flip rates and top-5 distance
- Leaderboard tracks both standalone and combined methods separately
- CIFAR-10-C, CIFAR-100-C, Tiny ImageNet-C/P available for faster iteration
- Requires only Python 3+ and PyTorch 0.3+; evaluation code is minimal
Caveats
- The repo is mostly datasets plus “some code” — the README’s phrasing, not mine — so expect glue rather than a training framework
- PyTorch 0.3+ requirement suggests the core code hasn’t been refreshed for modern versions
Verdict
Grab this if you’re building or evaluating vision models and need a standardized way to argue that your approach actually handles real-world degradation. Skip it if you’re looking for an end-to-end training pipeline; this is a measuring stick, not a toolbox.