baaivision/Emu3.5
A multimodal large language model that processes and generates interleaved visual-text sequences using unified next-token prediction without modality adapters.

Emu3.5 is a native multimodal foundation model that jointly predicts next states across vision and language. It is pre-trained on over 10 trillion interleaved tokens from video frames and transcripts using an end-to-end unified objective. The model handles both visual and textual modalities natively without adapters or task-specific heads, and employs reinforcement learning post-training to enhance reasoning capabilities.