showlab/VLog
Video-language understanding system that converts videos into queryable text documents for LLM-based conversation via a novel generative retrieval narrator.

VLog introduces a GPT2-based video narrator with Narration Vocabulary for efficient video-language understanding. The VLog-Agent branch extends this by converting videos into textual documents containing visual and audio information, then leveraging LLMs through LangChain to enable natural language chatting over video content. It integrates Whisper for audio processing and uses generative retrieval for efficient narration generation.