← all repositories
philipperemy/deep-speaker

Voice fingerprints on a budget: reproducing Baidu's speaker ID

An unofficial but thorough Keras/TensorFlow port of Baidu's Deep Speaker, complete with pretrained models and a six-day training recipe on consumer GPUs.

deep-speaker
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

Maps audio clips to 512-dimensional “voice fingerprints” using a ResCNN trained first with softmax, then refined with triplet loss. You feed it MFCCs from a WAV or FLAC; it returns an embedding where cosine similarity tells you whether two utterances come from the same speaker. The repo includes pretrained checkpoints, inference code, and the full training pipeline.

The interesting bit

The two-stage training mirrors the original paper’s philosophy but is honest about the hardware reality: ~6 days on a GTX 1070/1080 Ti, 300 GB SSD scratch space, and 32 GB RAM plus swap. The author also ships a Chinese cloud mirror for the pretrained model—practical, not performative.

Key highlights

  • Pretrained ResCNN Softmax+Triplet model: 99.7% accuracy, 2.5% EER on LibriSpeech “all” (2,484 speakers)
  • TensorFlow 2.3–2.6 compatible; inference works on newer versions, evaluation scripts pinned to 2.3
  • CLI handles the full drudgery: download LibriSpeech, build MFCCs, train softmax (~3 days), train triplets (~3 days)
  • Supports custom datasets if you match LibriSpeech’s directory layout and use FLAC (or ffmpeg from WAV)
  • Triplet loss with hard negative mining; author notes the training loss plateaus because hard examples stay hard

Caveats

  • test-model evaluation breaks on TensorFlow >2.3; the README explicitly warns about this
  • Performance drops on noisy data; the author recommends preprocessing with Sox to strip silence and background noise
  • Training demands are substantial—this is not a “pip install and go” solution for casual experimentation

Verdict

Worth a look if you need speaker verification/identification and want a reproducible, documented baseline without enterprise tooling. Skip it if you need real-time streaming inference or a plug-and-play API; this is research code with training wheels, not a product.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.