salesforce/LAVIS
A one-stop deep learning library for building, training, and evaluating vision-language and multimodal AI models.

LAVIS is a comprehensive library for language-vision intelligence developed by Salesforce AI Research. It provides unified interfaces for working with multimodal models including vision-language transformers, image captioning, visual question answering, and cross-modality frameworks built on frozen LLMs. The library supports integration of multiple modalities (image, video, audio, 3D) through reusable components for datasets, models, training pipelines, and evaluation benchmarks.