zli12321/Vision-Language-Models-Overview
A curated survey repository tracking the evolution of vision-language models across three architectural eras.

This repository maintains a comprehensive collection and survey of vision-language model papers and implementations. It documents the architectural progression from early frozen-encoder approaches through LLM-centric designs to modern native multimodal transformers. The survey covers benchmarking methodologies, evaluation frameworks, RL alignment techniques, and applications across leading models including GPT-4V, Claude, Gemini, LLaVA, and Qwen-VL variants.