baaivision/Emu3.5

A multimodal large language model that processes and generates interleaved visual-text sequences using unified next-token prediction without modality adapters.

★1.5k stars Python Language Models Image · Video · Audio

View on GitHub ↗

Velocity · 7d

+6.9

★ / day

Trend

→steady

star history

Emu3.5 is a native multimodal foundation model that jointly predicts next states across vision and language. It is pre-trained on over 10 trillion interleaved tokens from video frames and transcripts using an end-to-end unified objective. The model handles both visual and textual modalities natively without adapters or task-specific heads, and employs reinforcement learning post-training to enhance reasoning capabilities.