gokayfem/awesome-vlm-architectures
A curated collection documenting the architectures of famous Vision-Language Models including LLaVA, PaliGemma, and Janus-Pro.

This repository compiles detailed information on prominent Vision-Language Models, documenting their multimodal architectures, training procedures, and datasets used for pre-training and fine-tuning. It covers encoder fusion techniques, cross-attention mechanisms, and VLM families designed for visual understanding tasks like Visual Question Answering and image captioning. The collection serves as a reference resource for researchers and developers exploring VLM architecture patterns.