A research-grade ASR kitchen sink that still builds with Kaldi
For when you need to compare CTC, RNN-T, and six attention variants without rewriting training code.

What it does NeuralSP is a PyTorch toolkit for end-to-end speech recognition and language modeling. It bundles encoders (RNN, Transformer, Conformer, TDS convolution), decoders (CTC, RNN-Transducer, attention-based), and a menagerie of streaming variants under one training regime. You also get language models—RNNLM, Transformer-XL, gated CNN—and enough multi-task learning modes to make your head spin.
The interesting bit The streaming support is unusually thorough. Hard monotonic attention, MoChA, monotonic multihead attention, delay-constrained training, minimum latency training, CTC-synchronous training—most toolkits pick one or two. This one tracks the last several years of streaming ASR research like a bibliography come to life.
Key highlights
- Benchmarked results on 10+ corpora (AISHELL, Librispeech, Switchboard, CSJ, WSJ, etc.) with consistent model naming
- Front-end includes SpecAugment and adaptive variants; encoders cover Conformer and TDS convolution
- Decoder fusion options run deep: shallow, cold, deep, plus internal LM estimation and forward-backward attention
- Output units span phoneme to word-char mix; multi-task learning mixes CTC, attention, and LM objectives hierarchically
- Still depends on Kaldi for tooling build and pulls in warp-ctc / warp-transducer for efficient loss computation
Caveats
- Build process requires Kaldi path and manual tool compilation; not a
pip installexperience - README lists many features but offers minimal usage guidance beyond installation
- Travis CI badge suggests testing, but coverage and current maintenance status are unclear
Verdict Grab this if you’re reproducing streaming ASR papers or need a fair comparison across CTC/RNN-T/attention baselines. Skip it if you want a batteries-included, actively maintained framework with modern packaging—ESPnet has likely superseded much of this.