← all repositories

VITA-MLLM/VITA

A multimodal LLM enabling real-time vision and speech interaction at GPT-4o-level performance.

VITA
Velocity · 7d
+3.8
★ / day
Trend
steady
star history

VITA-1.5 is an open-source omni-modal large language model designed for real-time vision and speech interaction. It supports bidirectional understanding of video, audio, and text modalities in both English and Chinese. The project provides model weights, inference code, and a technical report, targeting GPT-4o-level conversational and perceptual capabilities.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.