PKU-YuanGroup/Chat-UniVi
A unified visual representation model that enables large language models to understand both images and video content.

Chat-UniVi is a vision-language model that empowers large language models with unified visual understanding capabilities for both images and video. The project proposes a unified visual representation that handles images and video through dynamic token allocation across different resolutions. This allows the model to effectively process multiple video frames while maintaining fine-grained image understanding. Published as a CVPR 2024 Highlight paper.