One model, many tongues: speech recognition that actually ships
OpenAI's Whisper replaces the usual Rube Goldberg pipeline of speech-processing tools with a single Transformer trained to do it all.

What it does Whisper is a general-purpose speech recognition system that handles multilingual transcription, speech-to-English translation, language identification, and voice activity detection. It runs as a single Transformer sequence-to-sequence model rather than chaining together separate tools for each step. You feed it audio; it emits text. There is a command-line tool, a Python API, and six model sizes ranging from 39M to 1.5B parameters.
The interesting bit
The clever part is the multitask training format. The model learns to interpret special tokens that specify which job to perform—transcribe, translate, identify language—so one decoder handles tasks that traditionally required a whole pipeline of specialized models. The turbo variant is an optimized version of the large model that trades a small accuracy hit for roughly 8× the speed.
Key highlights
- Six sizes, four with English-only variants; VRAM requirements span ~1 GB to ~10 GB
turbomodel runs ~8× faster thanlargewith “minimal degradation in accuracy”- Performance varies significantly by language; WER/CER breakdowns are published in the paper appendices
- Code and model weights are MIT-licensed
- Requires
ffmpegon your system; may need Rust installed for tiktoken compilation
Caveats
- The
turbomodel is English-transcription only; it will ignore--task translateand return the original language - Real-world speed “may vary significantly” from the A100 benchmarks depending on language, speaking speed, and hardware
- Installation can involve chasing down
setuptools_rustand PATH tweaks if pre-built wheels are missing
Verdict Worth a look if you need multilingual speech-to-text without orchestrating a stack of specialized tools. Skip it if you need real-time streaming or guaranteed low-latency inference—the 30-second sliding window and autoregressive decoding are not built for speed-of-conversation use cases.