ChenRocks/UNITER
A multimodal foundation model that learns joint image-text representations by pre-training on large-scale image-caption pairs.

UNITER is a transformer-based vision-language model that learns unified representations by jointly encoding images and text. The model is pre-trained on four types of tasks: masked language modeling, masked region modeling, image-text matching, and word-region alignment. Released checkpoints include UNITER-base and UNITER-large, with support for fine-tuning on NLVR2, VQA, VCR, SNLI-VE, image-caption retrieval, and referring expression comprehension tasks.