mbzuai-oryx/Video-ChatGPT
Video-ChatGPT is a vision-language model that enables conversational interaction about videos by combining a pretrained video encoder with large language models.

The model generates meaningful conversations about video content by integrating spatiotemporal video representations from a visual encoder with the reasoning capabilities of LLMs. It was published at ACL 2024 and introduces rigorous quantitative benchmarking (VCGBench-Diverse) specifically designed for evaluating video-based conversational models across diverse dimensions. The system supports zero-shot question answering on video datasets.