Speech recognition without the HMM baggage
Eesen strips ASR down to a single RNN and CTC loss, leaving the phonetic decision trees and GMMs behind.

What it does Eesen is an end-to-end speech recognizer that trains one bi-directional LSTM to map audio directly to text. It drops the usual ASR scaffolding—no HMMs, no GMMs, no decision trees, and optionally no pronunciation dictionary if you model characters directly. The project follows Kaldi’s recipe conventions but replaces the acoustic modeling stack with a much leaner pipeline.
The interesting bit The decoding is where the project gets clever. Eesen offers two paths: a WFST-based decoder that folds in lexicons and language models the old-fashioned way, and an RNN-LM decoder (TensorFlow branch) that ditches the fixed lexicon entirely. It’s a pragmatic split—use structure when you have it, learn it when you don’t.
Key highlights
- CTC loss handles alignment without forced phonetic segmentation
- GPU LSTM training with parallel utterance batching for speed
- WFST decoding integrates with standard n-gram language models
- Character-level modeling removes dictionary dependency
- Full example setups in
asr_egs/with both phoneme and character labels
Caveats
- The TensorFlow/RNN-LM decoder lives on a separate branch, not mainline
- Experimental results are per-example and not summarized; you’ll need to dig into each
RESULTSfile - 835 stars and last major activity circa 2015–2016 suggest maintenance mode
Verdict Worth a look if you’re teaching ASR or building a minimal baseline and want to understand what the traditional pipeline actually buys you. Skip it if you need production-grade support or modern transformer-based architectures.