salesforce/BLIP
Salesforce's PyTorch implementation of BLIP, a bootstrapped vision-language pre-training framework for unified visual-linguistic understanding and generation.

BLIP is a foundation model that learns visual-linguistic representations by bootstrapping pre-training on image-text pairs. The repository provides pre-trained checkpoints, fine-tuning code for downstream tasks including image captioning, visual question answering, and image-text retrieval, and interactive inference demos. It is built on PyTorch and has been integrated into the LAVIS library as its successor.