Pre-trained scene classifiers from MIT: 365 ways to read a room
A grab-bag of CNNs trained to label where a photo was taken, not what is in it.

What it does
Places365 ships a collection of pre-trained convolutional neural networks (AlexNet, VGG, ResNet, DenseNet) that classify images into 365 scene categories—patio, food_court, beer_garden, and so on. The models were trained on the Places365-Standard dataset (~1.8 million images) and a larger Places365-Challenge set (~8 million). You get weights in Caffe, Torch, and PyTorch formats, plus two Python scripts: one for bare scene prediction, another that also emits indoor/outdoor labels, scene attributes, and a class activation map.
The interesting bit
The project treats “scene understanding” as distinct from object detection. A ResNet152 trained from scratch on places hits 44.82% top-1 error—useful if you need context (“this is a cafeteria”) rather than bounding boxes (“there is a chair”). The unified demo script bundles category, attribute, and CAM outputs in one go, which is more plumbing than most model zoos bother with.
Key highlights
- Eight model architectures available, including hybrid models trained on ImageNet + Places365 (1,365 categories total)
- PyTorch models provided, though trained on Python 2.7 + PyTorch 0.2; a GitHub issue warns of format gotchas
- Indoor/outdoor labels and scene attribute predictions included via the unified script
- Training script (
train_placesCNN.py) and easy-format dataset tar provided if you want to retrain - CC BY license; citation to a 2017 IEEE TPAMI paper required
Caveats
- Several README notes are stuck in 2016: “ResidualNet’s performance will be updated soon,” and the PyTorch models target a long-EOL stack
- The Caffe/Torch heritage means you may spend time translating prototxts or wrestling with
loadcaffescale mismatches (0–255 vs 0–1)
Verdict
Worth a look if you need off-the-shelf scene context for legacy pipelines or research baselines. Skip it if you want modern, maintained models—today you’d probably fine-tune a CLIP or SigLIP variant instead.