SpeechPy: the boring speech-processing work, bottled
A Python library that extracts MFCCs and filterbank energies so you don't have to reimplement the DSP textbook.

What it does
SpeechPy turns raw audio waveforms into the standard feature vectors that speech recognizers actually eat: MFCCs, mel-filterbank energies, and their log variants. It also handles the housekeeping—stacking frames, pre-emphasis, power spectrum computation, plus global and sliding-window cepstral mean/variance normalization (CMVN). Basically the classic front-end pipeline that Kaldi does, but in pure Python with numpy.
The interesting bit
The library is deliberately narrow. It doesn’t train models or run inference; it just solves the “read a WAV, get a matrix” problem with sensible defaults (20 ms frames, 10 ms stride, 40 mel filters). The CMVN implementation is a nice touch—both global and local windowed versions—since channel normalization is where a lot of tutorial code quietly falls over.
Key highlights
- MFCC, filterbank energy, and log-filterbank extraction with standard parametric control
- Frame stacking with optional zero-padding and custom windowing
- Global and sliding-window CMVN for channel compensation
- Published in JOSS with a DOI, if citations matter to your pipeline
- pip-installable; depends on standard scipy/numpy stack
Caveats
- Python 2.7, 3.4, and 3.5 are the documented/tested versions; the README hasn’t been updated for newer Pythons, so compatibility is unclear
- The project appears largely unmaintained—last meaningful activity is several years old
- No GPU acceleration; this is CPU numpy all the way
Verdict
Good for students, researchers, or prototype pipelines that need classic acoustic features without dragging in all of Kaldi. Skip it if you want end-to-end neural features or a maintained, modern dependency.