Job-interview emotion decoder reads your face, voice, and words
A French Employment Agency partnership produced a Flask app that scores candidate emotions from video, audio, and text simultaneously.

What it does
This project is a real-time web app that ingests job-candidate interviews through three channels—facial video, vocal audio, and written text—and runs each through a separate deep-learning pipeline. The results are fused into an ensemble prediction served via a Flask interface. It was built in partnership with the French Employment Agency, which tells you exactly where they expected the most value: high-volume hiring.
The interesting bit
The audio pipeline uses a Time-Distributed CNN that slides a rolling window across log-mel-spectrograms, feeding local features into LSTMs for temporal context. It is the kind of architecture that sounds over-engineered until you remember recruiters do not review spectrograms by hand.
Key highlights
- Three independent modalities: text (CNN-LSTM with Word2Vec embeddings), audio (time-distributed CNN-LSTM on spectrograms), video (FER2013-trained facial model)
- Pre-trained weights and processed datasets provided via Google Drive for each modality
- Colab notebooks available for audio and video pipelines
- Flask web app with live webcam or file upload support
- Academic paper linked (Overleaf, read-only)
Caveats
- The README is enthusiastic but thin on ensemble mechanics and actual accuracy numbers; the “Accuracy_Speech” chart exists but no value is quoted
- FER2013 is explicitly called “quite challenging” with empty and mislabeled images
- The web app code lives in a separate repository, not this one
Verdict
Worth a look if you are building multimodal classifiers and want a worked example of late-fusion architecture with downloadable weights. Skip it if you need production-grade inference code or reproducible benchmarks.