← all repositories
QwenLM/Qwen3-ASR

A 1.7B-parameter whisper-killer that speaks 52 languages

Alibaba's Qwen team open-sources a compact ASR family that handles speech, music, songs, and timestamps without chaining half a dozen tools.

Qwen3-ASR
Velocity · 7d
+22
★ / day
Trend
steady
star history

What it does Qwen3-ASR is a family of speech recognition models—1.7B and 0.6B parameters—built on the Qwen3-Omni foundation. They transcribe 30 languages plus 22 Chinese dialects, identify language automatically, and even handle singing voice and songs with background music. A separate 0.6B non-autoregressive ForcedAligner model adds timestamp prediction for 11 languages. The qwen-asr Python package wraps everything, with transformers and vLLM backends, plus streaming, batch inference, and a Gradio demo.

The interesting bit The 0.6B model hits 2000× throughput at concurrency 128, while the 1.7B model claims state-of-the-art among open-source ASR and parity with top proprietary APIs. That’s a rare accuracy-efficiency trade-off actually delivered in one repo, not a spreadsheet fantasy.

Key highlights

  • 52 languages and dialects in one model, including regional Chinese variants like Wu and Minnan
  • Unified streaming and offline inference from the same weights
  • ForcedAligner timestamps arbitrary units in up to 5-minute clips, beating end-to-end alignment approaches
  • vLLM backend for production serving; Docker image and DashScope API also available
  • FlashAttention 2 supported for long-audio memory savings

Caveats

  • FlashAttention 2 requires float16/bfloat16 and compatible hardware; the README warns about RAM limits during install
  • vLLM backend needs careful if __name__ == '__main__' wrapping to avoid multiprocessing spawn errors

Verdict Worth a look if you need multilingual ASR without the usual Rube Goldberg pipeline of language-ID → diarization → transcription → alignment. Probably overkill if you only ever transcribe clean English phone calls.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.