microsoft/XPretrain
Multi-modality pre-training research repository from Microsoft Research covering video-language and image-language foundation models.

This repository contains recent research works from Microsoft Research’s Multimedia Search and Mining group focused on multi-modality learning, particularly pre-training methods. It includes models for video-language understanding (HD-VILA, LF-VILA, CLIP-ViP) and image-language understanding (Pixel-BERT, SOHO, VisualParsing), along with the HD-VILA-100M large-scale video-language dataset. The models use Transformer architectures and are trained with self-supervised pre-training objectives on visual and textual data.