← all repositories
openai/whisper

One model, many tongues: speech recognition that actually ships

OpenAI's Whisper replaces the usual Rube Goldberg pipeline of speech-processing tools with a single Transformer trained to do it all.

whisper
Velocity · 7d
+75
★ / day
Trend
steady
star history

What it does Whisper is a general-purpose speech recognition system that handles multilingual transcription, speech-to-English translation, language identification, and voice activity detection. It runs as a single Transformer sequence-to-sequence model rather than chaining together separate tools for each step. You feed it audio; it emits text. There is a command-line tool, a Python API, and six model sizes ranging from 39M to 1.5B parameters.

The interesting bit The clever part is the multitask training format. The model learns to interpret special tokens that specify which job to perform—transcribe, translate, identify language—so one decoder handles tasks that traditionally required a whole pipeline of specialized models. The turbo variant is an optimized version of the large model that trades a small accuracy hit for roughly 8× the speed.

Key highlights

  • Six sizes, four with English-only variants; VRAM requirements span ~1 GB to ~10 GB
  • turbo model runs ~8× faster than large with “minimal degradation in accuracy”
  • Performance varies significantly by language; WER/CER breakdowns are published in the paper appendices
  • Code and model weights are MIT-licensed
  • Requires ffmpeg on your system; may need Rust installed for tiktoken compilation

Caveats

  • The turbo model is English-transcription only; it will ignore --task translate and return the original language
  • Real-world speed “may vary significantly” from the A100 benchmarks depending on language, speaking speed, and hardware
  • Installation can involve chasing down setuptools_rust and PATH tweaks if pre-built wheels are missing

Verdict Worth a look if you need multilingual speech-to-text without orchestrating a stack of specialized tools. Skip it if you need real-time streaming or guaranteed low-latency inference—the 30-second sliding window and autoregressive decoding are not built for speed-of-conversation use cases.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.