← all repositories
alumae/kaldi-gstreamer-server

Kaldi speech recognition, served over WebSockets with a side of GStreamer

A real-time speech-to-text server that streams partial transcripts as you talk, built for scaling out rather than up.

1.1k stars Python Image · Video · Audio
kaldi-gstreamer-server
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This is a Python server that takes live audio streams over WebSockets and returns speech recognition results as they arrive—partial hypotheses first, final text later. It wraps the Kaldi speech recognition toolkit inside GStreamer’s media pipeline, then splits the work across a master process and independent worker processes that can live on separate machines.

The interesting bit

The architecture is deliberately old-school scalable: one worker per active recognition session, add more workers anywhere to handle more concurrent users. No GPU clustering magic, just Unix processes and WebSockets. It also persists acoustic model adaptation state between requests, so repeat users theoretically get better recognition over time.

Key highlights

  • Supports both legacy GMM and newer DNN/i-vector models (nnet2/nnet3), with the DNN path requiring a separate plugin compile
  • Handles arbitrarily long audio via silence-based segmentation
  • Can rescore recognition lattices with larger language models for better accuracy
  • Post-processing hooks let you rewrite results through external programs (e.g., words-to-numbers conversion)
  • Sample clients in Python, Java, JavaScript, and Haskell; includes English and Estonian demo models

Caveats

  • Requires Python 2.7, Tornado 4.x, and a specific ws4py version (0.3.2) due to a reported bug in 0.3.5
  • The postprocessing mechanism breaks with Tornado 5+; changelog recommends pinning to Tornado 4.5.3
  • Building Kaldi and its GStreamer plugins is “quite complicated”; Docker image exists but is community-maintained
  • nnet3 support added in 2016, noted as “not tested very carefully”

Verdict

Worth a look if you need self-hosted, real-time speech recognition with explicit control over acoustic models and scaling logic. Skip it if you want managed APIs, modern Python, or whisper.cpp-style simplicity—the dependency stack here is substantial and showing its age.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.