Voice fingerprints on a budget: reproducing Baidu's speaker ID
An unofficial but thorough Keras/TensorFlow port of Baidu's Deep Speaker, complete with pretrained models and a six-day training recipe on consumer GPUs.

What it does
Maps audio clips to 512-dimensional “voice fingerprints” using a ResCNN trained first with softmax, then refined with triplet loss. You feed it MFCCs from a WAV or FLAC; it returns an embedding where cosine similarity tells you whether two utterances come from the same speaker. The repo includes pretrained checkpoints, inference code, and the full training pipeline.
The interesting bit
The two-stage training mirrors the original paper’s philosophy but is honest about the hardware reality: ~6 days on a GTX 1070/1080 Ti, 300 GB SSD scratch space, and 32 GB RAM plus swap. The author also ships a Chinese cloud mirror for the pretrained model—practical, not performative.
Key highlights
- Pretrained ResCNN Softmax+Triplet model: 99.7% accuracy, 2.5% EER on LibriSpeech “all” (2,484 speakers)
- TensorFlow 2.3–2.6 compatible; inference works on newer versions, evaluation scripts pinned to 2.3
- CLI handles the full drudgery: download LibriSpeech, build MFCCs, train softmax (~3 days), train triplets (~3 days)
- Supports custom datasets if you match LibriSpeech’s directory layout and use FLAC (or
ffmpegfrom WAV) - Triplet loss with hard negative mining; author notes the training loss plateaus because hard examples stay hard
Caveats
test-modelevaluation breaks on TensorFlow >2.3; the README explicitly warns about this- Performance drops on noisy data; the author recommends preprocessing with Sox to strip silence and background noise
- Training demands are substantial—this is not a “pip install and go” solution for casual experimentation
Verdict
Worth a look if you need speaker verification/identification and want a reproducible, documented baseline without enterprise tooling. Skip it if you need real-time streaming inference or a plug-and-play API; this is research code with training wheels, not a product.