Alpha-VL/ConvMAE
MCMAE is a masked autoencoder vision backbone combining convolutions with transformer architecture for image classification, detection, and segmentation.

MCMAE (Masked Convolution Meets Masked Autoencoders) is a computer vision model that combines masked autoencoder pretraining with multi-scale convolutions. It provides pretrained backbone models that can be finetuned for downstream tasks including ImageNet classification, object detection (with Mask R-CNN), semantic segmentation, and video classification. The approach accelerates training and improves transfer learning performance compared to vanilla MAE by integrating hierarchical representations.