A KAIST VAD toolkit that still thinks it's 2017
MATLAB meets TensorFlow 1.x in a voice activity detection research artifact that ships its own noisy Korean street recordings.

What it does
This is a research toolkit for voice activity detection — figuring out when someone is actually speaking in an audio stream. It bundles four neural classifiers (DNN, bDNN, LSTM, and an attention-based ACAM model), a custom multi-resolution cochleagram feature extractor, and two hours of real-world recordings from bus stops, construction sites, parks, and rooms around KAIST. The whole pipeline is orchestrated through MATLAB, with the actual neural networks implemented in Python using TensorFlow 1.1–1.3.
The interesting bit
The ACAM model adapts ideas from visual attention (the “recurrent attention model” for image recognition) to the audio domain — a neat cross-modal transplant that the authors published in IEEE Signal Processing Letters. The bundled dataset is genuinely unusual: recorded on a Samsung Galaxy S8 in actual Korean environments, complete with crying babies, insect chirps, and mouse clicks as “bonus” noise sources.
Key highlights
- Four classifier architectures in one toolkit, all using the same MRCG frontend
- Includes 120 minutes of annotated real-world speech with ground-truth labels (bus stop SNR: 5.6 dB, construction site: 2.05 dB — properly miserable)
- Post-processing parameters exposed for tuning specific error types (false entrance/exit, missed speech, over-segmentation)
- Python reimplementation available in a separate branch
- Presented at ICASSP 2019
Caveats
- MATLAB 2017b dependency, explicitly noted as “will be depreciated” since at least 2018
- TensorFlow 1.x requirement — you’ll need to resurrect old Python environments
- MRCG feature extraction is flagged by the authors themselves as “somewhat long”; a TODO to replace it with spectrograms has sat unresolved for years
- 16 kHz sampling rate is mandatory; no resampling convenience provided
Verdict
Worth a look if you’re reproducing the ACAM paper or need a small, messy real-world VAD dataset for benchmarking. Everyone else should probably start with something that doesn’t require a time-traveling Python environment.