Neural networks that learn frequency, not filter coefficients
SincNet replaces CNN filter learning with tunable band-pass filters derived from sinc functions, cutting parameters while keeping interpretability.

What it does SincNet is a CNN for raw audio that replaces the first convolutional layer’s free-form filters with parametrized sinc functions — essentially learnable band-pass filters. Instead of learning every tap of a filter kernel, the network only learns low and high cutoff frequencies. The repo provides a full speaker-identification pipeline built on this idea, with a TIMIT example and training utilities.
The interesting bit The insight is old-school signal processing dressed in deep-learning clothes: by constraining the filter shape to sinc functions, you bake in the prior that audio analysis needs frequency-selective filters. The result is a compact, interpretable filter bank that the authors call “customized” to the task — fewer parameters, less overfitting risk, and filters you can actually inspect.
Key highlights
- First layer learns only 2 parameters per filter (cutoff frequencies) vs. hundreds in standard CNNs
- Includes complete TIMIT speaker-ID experiment with config-driven training
- SincConv_fast implementation is 50% faster than the original
- Also integrated into the broader SpeechBrain and PyTorch-Kaldi projects
- Trained TIMIT model available for download
Caveats
- Code is explicitly a “showcase” — the authors note speed optimizations are missing and I/O is not cluster-friendly without local data copying
- Training on a TITAN X took ~24 hours; convergence slows and oscillates after epoch 30
- Librispeech version used in the paper is “available upon request,” not bundled
Verdict Worth studying if you care about interpretable inductive biases in audio networks or need a pedagogical example of hybrid DSP/deep learning. For production speaker recognition, the authors themselves point to SpeechBrain or PyTorch-Kaldi instead.