← all repositories
astorfi/lip-reading-deeplearning

Teaching neural networks to read lips by listening

A 2017 TensorFlow implementation that matches audio and video streams using coupled 3D CNNs, with lip reading as the demo application.

lip-reading-deeplearning
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

This repo implements a coupled 3D convolutional neural network that learns whether a 0.3-second audio clip matches a 0.3-second video of lip motion. The visual stream processes 9 grayscale mouth-region frames (9×60×100); the audio stream processes a spectrogram cube of MFEC features and derivatives (15×40×3). A lip-tracking utility using dlib extracts mouth regions from arbitrary videos as a preprocessing step.

The interesting bit

The architecture treats both modalities as spatio-temporal volumes: 3D convolutions run across stacked video frames and stacked audio spectrogram windows. The paper’s claimed edge is “online pair selection” for training — though the code only implements a simpler hard-threshold version, which the README discloses upfront.

Key highlights

  • TensorFlow 1.x-era implementation of the IEEE Access 2017 paper by Torfi et al.
  • Includes VisualizeLip.py for dlib-based mouth extraction and bounding-box visualization
  • Audio features rely on the author’s companion SpeechPy package
  • Processing pipeline standardizes to 30 fps, extracts audio via FFmpeg
  • Training and evaluation scripts are thin wrappers: train.py and test.py

Caveats

  • The input pipeline is entirely BYO: you must prepare your own dataset and feature extraction; the code assumes “utterance-based extracted features” are already sitting there
  • The adaptive pair-selection method from the paper is not implemented — only basic hard thresholding
  • README is vague on dataset specifics, hardware requirements, and how to actually wire your data into the network

Verdict

Worth a look if you’re reproducing classic audio-visual matching baselines or studying 3D CNN architectures for multimodal fusion. Skip it if you need a batteries-included lip-reading toolkit or modern PyTorch code — this is research scaffolding from the TF 1.x era, not a product.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.