invictus717/MetaTransformer
A single transformer architecture that processes 12 modalities including text, images, audio, video, point clouds, and graphs without modality-specific modifications.

Meta-Transformer is a unified multimodal learning framework that uses a single frozen transformer encoder to handle diverse data modalities. The approach maps different modality inputs into a shared token space and processes them through a standard transformer backbone without any modality-specific modifications. It was published at ICCV 2023 and has received significant citations, demonstrating broad research community interest in unified multimodal architectures.