A 1.7B-parameter whisper-killer that speaks 52 languages
Alibaba's Qwen team open-sources a compact ASR family that handles speech, music, songs, and timestamps without chaining half a dozen tools.

What it does
Qwen3-ASR is a family of speech recognition models—1.7B and 0.6B parameters—built on the Qwen3-Omni foundation. They transcribe 30 languages plus 22 Chinese dialects, identify language automatically, and even handle singing voice and songs with background music. A separate 0.6B non-autoregressive ForcedAligner model adds timestamp prediction for 11 languages. The qwen-asr Python package wraps everything, with transformers and vLLM backends, plus streaming, batch inference, and a Gradio demo.
The interesting bit The 0.6B model hits 2000× throughput at concurrency 128, while the 1.7B model claims state-of-the-art among open-source ASR and parity with top proprietary APIs. That’s a rare accuracy-efficiency trade-off actually delivered in one repo, not a spreadsheet fantasy.
Key highlights
- 52 languages and dialects in one model, including regional Chinese variants like Wu and Minnan
- Unified streaming and offline inference from the same weights
- ForcedAligner timestamps arbitrary units in up to 5-minute clips, beating end-to-end alignment approaches
- vLLM backend for production serving; Docker image and DashScope API also available
- FlashAttention 2 supported for long-audio memory savings
Caveats
- FlashAttention 2 requires
float16/bfloat16and compatible hardware; the README warns about RAM limits during install - vLLM backend needs careful
if __name__ == '__main__'wrapping to avoid multiprocessing spawn errors
Verdict Worth a look if you need multilingual ASR without the usual Rube Goldberg pipeline of language-ID → diarization → transcription → alignment. Probably overkill if you only ever transcribe clean English phone calls.