← all repositories

ChenRocks/UNITER

A multimodal foundation model that learns joint image-text representations by pre-training on large-scale image-caption pairs.

800 stars Python Language ModelsComputer Vision
UNITER
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

UNITER is a transformer-based vision-language model that learns unified representations by jointly encoding images and text. The model is pre-trained on four types of tasks: masked language modeling, masked region modeling, image-text matching, and word-region alignment. Released checkpoints include UNITER-base and UNITER-large, with support for fine-tuning on NLVR2, VQA, VCR, SNLI-VE, image-caption retrieval, and referring expression comprehension tasks.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.