Teaching neural networks to read lips by listening
A 2017 TensorFlow implementation that matches audio and video streams using coupled 3D CNNs, with lip reading as the demo application.

What it does
This repo implements a coupled 3D convolutional neural network that learns whether a 0.3-second audio clip matches a 0.3-second video of lip motion. The visual stream processes 9 grayscale mouth-region frames (9×60×100); the audio stream processes a spectrogram cube of MFEC features and derivatives (15×40×3). A lip-tracking utility using dlib extracts mouth regions from arbitrary videos as a preprocessing step.
The interesting bit
The architecture treats both modalities as spatio-temporal volumes: 3D convolutions run across stacked video frames and stacked audio spectrogram windows. The paper’s claimed edge is “online pair selection” for training — though the code only implements a simpler hard-threshold version, which the README discloses upfront.
Key highlights
- TensorFlow 1.x-era implementation of the IEEE Access 2017 paper by Torfi et al.
- Includes
VisualizeLip.pyfor dlib-based mouth extraction and bounding-box visualization - Audio features rely on the author’s companion
SpeechPypackage - Processing pipeline standardizes to 30 fps, extracts audio via FFmpeg
- Training and evaluation scripts are thin wrappers:
train.pyandtest.py
Caveats
- The input pipeline is entirely BYO: you must prepare your own dataset and feature extraction; the code assumes “utterance-based extracted features” are already sitting there
- The adaptive pair-selection method from the paper is not implemented — only basic hard thresholding
- README is vague on dataset specifics, hardware requirements, and how to actually wire your data into the network
Verdict
Worth a look if you’re reproducing classic audio-visual matching baselines or studying 3D CNN architectures for multimodal fusion. Skip it if you need a batteries-included lip-reading toolkit or modern PyTorch code — this is research scaffolding from the TF 1.x era, not a product.