NVlabs/prismer
Prismer is a vision-language model that uses pre-trained experts across multiple vision-language tasks including image captioning and visual question answering.

Prismer implements a vision-language architecture combining multiple pre-trained expert models to handle diverse vision-language tasks. The model supports image captioning, visual question answering, and other multimodal tasks through a multi-task expert framework. It is built on PyTorch with Hugging Face accelerate for distributed multi-node multi-gpu training. A demo is available via HuggingFace Spaces.