Hand gestures meet 2017-era Keras in this time-capsule CNN project
A complete webcam-to-gesture pipeline that shows its age in the dependencies but still demonstrates how to wire OpenCV preprocessing to a trainable CNN.

What it does
CNNGestureRecognizer is a Python application that captures hand gestures via webcam, preprocesses the frames with OpenCV, and classifies them into one of five categories: OK, PEACE, STOP, PUNCH, or NOTHING. It ships with 4,015 training images, pretrained weights, and a small Tkinter UI for switching between prediction, retraining, and layer visualization modes.
The interesting bit
The author treats this as an educational scaffold rather than a product. You can peek inside the model’s “thinking” with built-in feature-map visualization, and the README walks through exactly how the OpenCV preprocessing works—two different pipelines (binary threshold vs. skin-color masking) for different lighting conditions. There’s even a Chrome Dino game hookup buried in the truncated conclusion.
Key highlights
- Pretrained on 4,015 self-collected images (803 per class) for 15 epochs; weights split by OS due to backend serialization quirks
- Two capture modes: Binary Mode for clean backgrounds, SkinMask Mode for HSV-based skin detection when lighting is good
- Real-time prediction with an in-app probability bar chart; can dump results to JSON for external plotting
- Layer visualization via Keras backend functions—see which filters activate on your own gesture images
- Includes hooks to retrain or extend the model with custom gestures
Caveats
- Dependency stack is frozen in 2017: Python 3.6.1, Keras 2.0.2, TensorFlow 1.2.1, and Theano 0.9.0 (explicitly noted as obsolete)
- Pretrained weights are 150 MB each and hosted on Google Drive, not in the repo
- The CNN architecture is a standard MNIST-style stack; the author admits it’s “pretty common” and not novel
- OS-specific weight files suggest serialization fragility across platforms
Verdict
Grab this if you’re teaching computer vision fundamentals or need a hackable baseline for gesture control. Skip it if you want modern MediaPipe-level accuracy without the dependency archaeology.