← all repositories
srvk/eesen

Speech recognition without the HMM baggage

Eesen strips ASR down to a single RNN and CTC loss, leaving the phonetic decision trees and GMMs behind.

eesen
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does Eesen is an end-to-end speech recognizer that trains one bi-directional LSTM to map audio directly to text. It drops the usual ASR scaffolding—no HMMs, no GMMs, no decision trees, and optionally no pronunciation dictionary if you model characters directly. The project follows Kaldi’s recipe conventions but replaces the acoustic modeling stack with a much leaner pipeline.

The interesting bit The decoding is where the project gets clever. Eesen offers two paths: a WFST-based decoder that folds in lexicons and language models the old-fashioned way, and an RNN-LM decoder (TensorFlow branch) that ditches the fixed lexicon entirely. It’s a pragmatic split—use structure when you have it, learn it when you don’t.

Key highlights

  • CTC loss handles alignment without forced phonetic segmentation
  • GPU LSTM training with parallel utterance batching for speed
  • WFST decoding integrates with standard n-gram language models
  • Character-level modeling removes dictionary dependency
  • Full example setups in asr_egs/ with both phoneme and character labels

Caveats

  • The TensorFlow/RNN-LM decoder lives on a separate branch, not mainline
  • Experimental results are per-example and not summarized; you’ll need to dig into each RESULTS file
  • 835 stars and last major activity circa 2015–2016 suggest maintenance mode

Verdict Worth a look if you’re teaching ASR or building a minimal baseline and want to understand what the traditional pipeline actually buys you. Skip it if you need production-grade support or modern transformer-based architectures.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.