QwenLM/Qwen2.5-Omni
A 7B-parameter end-to-end multimodal foundation model by Alibaba's Qwen team that processes text, images, audio, and video while generating both text and speech.

Velocity · 7d
+9.1
★ / day
Trend
→steady
star history
Qwen2.5-Omni is a flagship multimodal foundation model from Alibaba Cloud’s Qwen team. It processes diverse inputs including text, images, audio, and video in an end-to-end manner, and can generate streaming text responses and natural speech synthesis. The model ranked first among 7B-parameter multimodal models and is available on Hugging Face and ModelScope with supporting cookbooks and demos.