Teaching CNNs to read sign language, one frame at a time
A classic two-stage pipeline that extracts visual features with Inception v3, then lets an RNN figure out the temporal story.

What it does
This repo implements a sign-language gesture recognizer that processes video sequences. It slices videos into frames, retrains Google’s Inception v3 on those frames, then feeds either softmax probabilities or raw pool-layer features into an RNN (LSTM) to classify the full gesture. The work is tied to a published paper on Argentinian Sign Language.
The interesting bit
The two-stage design is deliberately modular: you can swap the “understanding” part by choosing either the final classification layer or the pre-classification pool layer as your frame representation. It’s a snapshot of how temporal video understanding was commonly tackled before end-to-end transformers took over.
Key highlights
- Frame extraction with optional hand-segmentation preprocessing (dataset-specific, but removable)
- Retrains Inception v3 via TensorFlow Hub’s standard retrain script
- Two intermediate representations: 2048-dim pool vectors or n-class softmax distributions
- RNN training/evaluation scripts with pickled feature dumps as input
- Tested on a dummy 3-class dataset in Google Colab
Caveats
- Dependencies include tflearn, which is effectively unmaintained
- OpenCV must be built from source; pip’s version lacks video support
- The hand-segmentation step is hardcoded for the Argentinian dataset and needs manual removal for other data
Verdict
Worth a look if you’re studying classical video-classification pipelines or need a reproducible baseline for sign-language research. Skip it if you want a modern, end-to-end trainable model you can drop your own data into without surgery.