← all repositories

microsoft/XPretrain

Multi-modality pre-training research repository from Microsoft Research covering video-language and image-language foundation models.

XPretrain
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

This repository contains recent research works from Microsoft Research’s Multimedia Search and Mining group focused on multi-modality learning, particularly pre-training methods. It includes models for video-language understanding (HD-VILA, LF-VILA, CLIP-ViP) and image-language understanding (Pixel-BERT, SOHO, VisualParsing), along with the HD-VILA-100M large-scale video-language dataset. The models use Transformer architectures and are trained with self-supervised pre-training objectives on visual and textual data.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.