Is whisperX open source?

Yes — m-bain/whisperX is open source, released under the BSD-2-Clause license.

What language is whisperX written in?

m-bain/whisperX is primarily written in Python.

How popular is whisperX?

m-bain/whisperX has 23.2k stars on GitHub and is currently holding steady.

Where can I find whisperX?

m-bain/whisperX is on GitHub at https://github.com/m-bain/whisperX.

← all repositories

m-bain/whisperX

Whisper, but make it actually useful for production

OpenAI's Whisper is accurate but slow and timestamp-imprecise; WhisperX bolts on batching, forced phoneme alignment, and speaker diarization to fix that.

★23.2k stars Python Image · Video · Audio

View on GitHub ↗

Velocity · 7d

+16

★ / day

Trend

→steady

star history

What it does WhisperX wraps OpenAI’s Whisper with three production-hungry features: batched inference (claimed 70× realtime with large-v2), word-level timestamps via wav2vec2 forced alignment, and speaker diarization via pyannote-audio. The result is a pipeline that transcribes, timestamps, and labels who spoke when.

The interesting bit The clever part isn’t any single model—it’s the assembly line. Whisper does the transcription without timestamps to enable batching, then wav2vec2 realigns the text to audio at the phoneme level for precise word boundaries. VAD preprocessing splits audio into speech segments first, which both speeds things up and cuts Whisper’s tendency to hallucinate in silence. It’s a case study in gluing specialist models together to compensate for each other’s weaknesses.

Key highlights

Batched inference through faster-whisper backend; runs large-v2 on <8GB GPU
Word-level timestamps via wav2vec2 forced alignment, not Whisper’s coarse utterance-level timing
Speaker diarization with speaker ID labels (requires Hugging Face token for pyannote model)
VAD-based segmentation reduces hallucination and enables batching “with no WER degradation” per the paper
INTERSPEECH 2023 publication; 1st place in Ego4d transcription challenge
Supports multiple languages with automatic alignment model selection for tested languages (en, fr, de, es, it and others via Hugging Face)

Caveats

Speaker diarization requires accepting a user agreement on Hugging Face and providing an access token
For untested languages, you’re on your own to find a compatible phoneme-based ASR model
--condition_on_prev_text is disabled by default to reduce hallucination, which may trade off some contextual coherence

Verdict Anyone building meeting transcripts, subtitles, or searchable audio archives should look here instead of raw Whisper. If you just need quick one-off transcription and don’t care about word timing or speaker labels, the added complexity is probably overkill.

Frequently asked

What is m-bain/whisperX?: OpenAI's Whisper is accurate but slow and timestamp-imprecise; WhisperX bolts on batching, forced phoneme alignment, and speaker diarization to fix that.
Is whisperX open source?: Yes — m-bain/whisperX is open source, released under the BSD-2-Clause license.
What language is whisperX written in?: m-bain/whisperX is primarily written in Python.
How popular is whisperX?: m-bain/whisperX has 23.2k stars on GitHub and is currently holding steady.
Where can I find whisperX?: m-bain/whisperX is on GitHub at https://github.com/m-bain/whisperX.