Pretrained 3D ResNet: drop in a video, get action labels
A straightforward inference wrapper for spatiotemporal CNNs trained on 400 human actions.

What it does Feed it a video folder and a pretrained 3D ResNet (ResNet-34 or ResNeXt-101) and it spits out JSON: either class scores across 400 Kinetics action categories, or 512-dim feature vectors, computed every 16 frames. There’s also a small visualization script to overlay predictions back onto the source video.
The interesting bit
The project is essentially a clean inference harness around the author’s earlier training codebase. The value isn’t novelty—it’s convenience. You don’t retrain; you download weights, point at ~/videos, and run. The 2017 paper’s question—“Can 3D CNNs retrace 2D CNNs’ history?"—is answered here with a pragmatic “yes, and here’s the tool.”
Key highlights
- Pretrained on Kinetics-400 (400 action classes)
- Two modes:
score(class predictions) orfeature(512-dim embeddings post-global average pooling) - Supports ResNeXt-101, which the authors note performed best
- Includes a result visualization script
- Companion Lua/Torch version exists for the historically inclined
Caveats
- Setup instructions reference PyTorch 0.x-era conda channels (
soumith,cuda80) and FFmpeg 3.3.3; expect to adapt for modern environments - The README is sparse on input format specifics—resolution, codec compatibility, exact JSON schema are left unstated
- No mention of GPU memory requirements or batching behavior for long videos
Verdict Useful if you need quick, off-the-shelf action recognition or video feature extraction without building a pipeline from scratch. Skip if you need fine-grained temporal modeling, custom classes, or production-grade robustness—this is research code with research-code edges.