apple/ml-4m
A training framework for any-to-any multimodal foundation models supporting dozens of vision modalities and tasks.

Velocity · 7d
+2.3
★ / day
Trend
→steady
star history
4M is a research framework for training foundation models that handle arbitrary input-output modality combinations using masked token modeling and unified tokenization. The released 4M-7 and 4M-21 models perform diverse vision tasks including generation, detection, segmentation, and transfer to unseen tasks and modalities. Code, pretrained weights, and training infrastructure are open-sourced for reproducibility.