microsoft/SimMIM
Microsoft's official implementation of SimMIM, a self-supervised framework for pre-training vision transformers via masked image modeling.

SimMIM provides a simple framework for masked image modeling to pre-train vision transformers. The approach uses random masking with a moderately large patch size (e.g., 32) and predicts raw pixel RGB values through direct regression. The framework supports pre-training and fine-tuning on ImageNet-1K with Swin Transformer and ViT models, achieving strong representation learning performance without complex prediction head designs.