OpenGVLab/InternVL
A family of open-source multimodal large language models supporting vision-language tasks such as image classification, semantic segmentation, and video understanding.

InternVL is a research project providing open-source multimodal models that compete with commercial systems like GPT-4o. It supports visual question answering, image-text retrieval, semantic segmentation, and video classification through a vision-language architecture combining ViT encoders with large language models. The project offers model weights, training code, inference tools, and a chat demo.