← all repositories
QuentinFuxa/WhisperLiveKit

Self-hosted speech-to-text that actually streams properly

Whisper isn't built for real-time chunks; this kit adds the buffering and incremental processing to make it work.

WhisperLiveKit
Velocity · 7d
+19
★ / day
Trend
steady
star history

What it does

WhisperLiveKit runs a local speech-to-text server with low-latency streaming, speaker diarization, and optional real-time translation into 200 languages. It exposes OpenAI-compatible REST and Deepgram-compatible WebSocket APIs, plus a browser UI and Chrome extension for capturing web audio. You can also transcribe files or generate subtitles from the CLI without starting a server.

The interesting bit

The core problem: Whisper expects complete utterances, so naïvely feeding it audio chunks mangles words mid-syllable. The project layers several simultaneous-speech research policies—AlignAtt SimulStreaming, LocalAgreement, plus VAD-based buffering—to incrementally commit text only when the model is confident, keeping latency down without hallucinating retractions.

Key highlights

  • Multiple backends: standard Whisper, Mistral’s 4B Voxtral model (better auto language detection), and Qwen3-ASR via vLLM
  • Speaker diarization via Streaming Sortformer or Diart
  • Simultaneous translation via distilled NLLB (200 languages)
  • wlk CLI handles model management, file transcription, subtitle export, and reproducible speed/accuracy benchmarks
  • Optional extras for Apple Silicon MLX, CUDA 12.9, CPU PyTorch, etc.

Caveats

  • Several feature extras are mutually incompatible and require separate virtual environments (e.g., qwen3-vllm vs cu129, voxtral-hf vs diarization-sortformer)
  • Auto language detection with Whisper biases toward English; Voxtral is recommended for multilingual use
  • The README notes a “not recommended” Diart diarization path still ships as an optional extra

Verdict

Worth a look if you need self-hosted, real-time transcription with speaker labels or live translation and don’t want to pay per-minute API fees. Skip it if you just need batch transcription on complete files—plain Whisper or faster-whisper is simpler.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.