Teaching machines to hear feelings, with four algorithms and a YAML file
A tidy Keras reference implementation for speech emotion recognition that swaps model architectures like Lego blocks.

What it does
This repo trains classifiers to detect emotion from audio—think angry, happy, sad, surprised—using four approaches under one roof: LSTM, CNN, SVM, and MLP. It handles the full pipeline from feature extraction through training to prediction, with pre-trained checkpoints included so you can run inference without waiting for a GPU to warm up.
The interesting bit
The author claims a bump to ~80% accuracy from an earlier version, mostly by improving feature extraction. The real utility is the side-by-side comparison: you can pit a lightweight SVM against a CNN or LSTM using the same preprocessed features and see which architecture actually suits your dataset. It also supports OpenSMILE’s standard INTERSPEECH feature sets (up to 6,373 features in ComParE_2016), which is more thorough than the usual librosa-only tutorials.
Key highlights
- Supports four model families: LSTM, CNN, SVM, and MLP, all sharing a common base class
- Feature extraction via librosa or OpenSMILE (IS09 through ComParE_2016)
- Pre-trained checkpoints available; YAML configs control the pipeline
- Built-in plotting: radar charts for prediction probabilities, waveform/spectrogram visualization, training curves
- Supports four datasets: RAVDESS, SAVEE, EMO-DB, and CASIA (English, German, Chinese)
Caveats
- The ~80% accuracy figure lacks specifics on which dataset or model achieved it
- Python 3.8 and TensorFlow 2 are pinned; no mention of newer versions
- OpenSMILE integration is optional and requires manual installation
Verdict
Worth bookmarking if you need a clean, comparable baseline for speech emotion recognition or you’re teaching the topic. Skip it if you need production-grade inference or real-time streaming—the repo is research scaffolding, not a deployed system.