← all repositories
harvitronix/five-video-classification-methods

Five ways to teach a neural network what happens in a video

A 2017-era reference implementation comparing ConvNet, LSTM, 3D CNN, and MLP approaches to video classification, all wired up for the UCF101 dataset.

1.2k stars Python Computer VisionML Frameworks
five-video-classification-methods
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

This repo implements five classic architectures for classifying human actions in video, using the UCF101 dataset of 101 action categories. You get frame-by-frame CNN classification, CNN-to-LSTM pipelines (both two-stage and end-to-end LRCN), CNN-to-MLP, and 3D convolutional networks. It’s essentially a working notebook made public: run the data extraction scripts, wait eight hours for feature extraction on a mid-tier GPU, then train your pick of models.

The interesting bit

The value is in the side-by-side comparison, not novelty. The author wired up five approaches that were standard circa 2017 so you can see how they differ in complexity and plumbing — from “just pretend video is a stack of images” to “actually model spatiotemporal cubes with 3D convolutions.” The LRCN variant (time-distributed CNN feeding an RNN in one network) is the most architecturally elegant of the bunch.

Key highlights

  • Five model architectures defined in a single models.py for easy comparison
  • Full data pipeline from raw UCF101 videos to frame sequences and CSV manifests
  • Feature extraction cached to disk so LSTM/MLP training doesn’t re-run CNN forward passes
  • TensorBoard and CSV logging built in
  • Multiple worker support in the data generator (checked off the TODO list)

Caveats

  • No demo script: you cannot point a finished model at a new video and get a prediction without writing code yourself
  • Locked to Keras 2 and TensorFlow 1.x — this is legacy stack territory now
  • Requires ffmpeg and manual path tweaking on non-Unix systems
  • Data augmentation and optical flow are on the TODO list, not implemented

Verdict

Worth a look if you’re teaching computer vision or need a baseline to beat on UCF101. Skip it if you want production-ready video understanding — modern transformers and pre-trained video backbones have left this approach behind, and the dependency stack shows its age.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.