QwenLM/Qwen3-Omni
A natively end-to-end multilingual omni-modal foundation model that processes text, audio, images, and video while generating real-time text and speech responses.

Velocity · 7d
+15
★ / day
Trend
→steady
star history
Qwen3-Omni is a foundation model that handles multiple modalities in a unified architecture. It processes text, images, audio, and video as inputs and generates both text and natural speech as outputs in real time. The model represents an end-to-end approach to multimodal understanding and generation, released with model weights, demos, and cookbooks by Alibaba Cloud’s Qwen team.