Kaldi speech recognition, served over WebSockets with a side of GStreamer
A real-time speech-to-text server that streams partial transcripts as you talk, built for scaling out rather than up.

What it does
This is a Python server that takes live audio streams over WebSockets and returns speech recognition results as they arrive—partial hypotheses first, final text later. It wraps the Kaldi speech recognition toolkit inside GStreamer’s media pipeline, then splits the work across a master process and independent worker processes that can live on separate machines.
The interesting bit
The architecture is deliberately old-school scalable: one worker per active recognition session, add more workers anywhere to handle more concurrent users. No GPU clustering magic, just Unix processes and WebSockets. It also persists acoustic model adaptation state between requests, so repeat users theoretically get better recognition over time.
Key highlights
- Supports both legacy GMM and newer DNN/i-vector models (nnet2/nnet3), with the DNN path requiring a separate plugin compile
- Handles arbitrarily long audio via silence-based segmentation
- Can rescore recognition lattices with larger language models for better accuracy
- Post-processing hooks let you rewrite results through external programs (e.g., words-to-numbers conversion)
- Sample clients in Python, Java, JavaScript, and Haskell; includes English and Estonian demo models
Caveats
- Requires Python 2.7, Tornado 4.x, and a specific ws4py version (0.3.2) due to a reported bug in 0.3.5
- The postprocessing mechanism breaks with Tornado 5+; changelog recommends pinning to Tornado 4.5.3
- Building Kaldi and its GStreamer plugins is “quite complicated”; Docker image exists but is community-maintained
- nnet3 support added in 2016, noted as “not tested very carefully”
Verdict
Worth a look if you need self-hosted, real-time speech recognition with explicit control over acoustic models and scaling logic. Skip it if you want managed APIs, modern Python, or whisper.cpp-style simplicity—the dependency stack here is substantial and showing its age.