QwenLM/Qwen2.5-Omni

A 7B-parameter end-to-end multimodal foundation model by Alibaba's Qwen team that processes text, images, audio, and video while generating both text and speech.

★4k stars Jupyter Notebook Language Models Image · Video · Audio

View on GitHub ↗

Velocity · 7d

+9.1

★ / day

Trend

→steady

star history

Qwen2.5-Omni is a flagship multimodal foundation model from Alibaba Cloud’s Qwen team. It processes diverse inputs including text, images, audio, and video in an end-to-end manner, and can generate streaming text responses and natural speech synthesis. The model ranked first among 7B-parameter multimodal models and is available on Hugging Face and ModelScope with supporting cookbooks and demos.