OpenGVLab/InternVideo
InternVideo provides video foundation models for multimodal video understanding, including video LLMs with temporal reasoning capabilities.

This repository contains the InternVideo series of video foundation models trained via generative and discriminative self-supervised learning. InternVideo2 scales these models with multimodal capabilities including video-language alignment and instruction tuning. InternVideo2.5 adds long-context modeling for extended video understanding. The models are distilled into smaller variants and integrated with 7B language models for video chat applications.