Is Vision-Language-Models-Overview open source?

Yes — zli12321/Vision-Language-Models-Overview is an open-source project tracked on heatdrop.

What language is Vision-Language-Models-Overview written in?

zli12321/Vision-Language-Models-Overview is primarily written in HTML.

How popular is Vision-Language-Models-Overview?

zli12321/Vision-Language-Models-Overview has 673 stars on GitHub.

Where can I find Vision-Language-Models-Overview?

zli12321/Vision-Language-Models-Overview is on GitHub at https://github.com/zli12321/Vision-Language-Models-Overview.

← all repositories

zli12321/Vision-Language-Models-Overview

A field guide to the vision-language model explosion

A living survey that sorts the flood of vision-language research into architectures, benchmarks, and alignment methods.

★673 stars HTML Language Models Learning

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository is a curated academic survey and bibliography of large vision-language models, anchored by a CVPR 2025 workshop paper. It compiles state-of-the-art models, evaluation benchmarks, reinforcement-learning alignment techniques, and application areas—from robotics to medical VQA—into sortable tables and dated progressive reports. The maintainers also impose a three-era architectural taxonomy to organize what would otherwise be an overwhelming torrent of arXiv preprints.

The interesting bit

Rather than simply listing papers, the README proposes a structural narrative: VLMs have moved from bridged encoder-decoder designs (Era 1) to LLM-backbone adapters (Era 2), and now to natively fused transformers that either output text only (Era 3a) or generate images and audio as well (Era 3b). That split between “multimodal-in, text-out” and true omni-modal I/O is a genuinely useful lens for comparing current flagships.

Key highlights

Living document: dated “progressive research reports” track new models, benchmarks, and post-training methods between major updates.
Broad coverage: tables span SoTA VLMs, training datasets, RL alignment variants (GRPO, SFT), and domain applications including embodied AI and autonomous driving.
Architectural taxonomy: the README explicitly categorizes models into Era 1 (bridged), Era 2 (LLM-centric adapter), and Era 3 (native multimodal fusion with text-only or omni-modal output).
Academic backing: the work is published as a CVPR 2025 workshop paper with a proper BibTeX citation block.
Community contributions: the maintainers welcome pull requests for surveys, perspectives, and datasets.

Caveats

This is a curation and taxonomy, not a framework: expect links and tables, not runnable code or reproducible training pipelines.
Many entries in the model comparison table list “Undisclosed” for parameters, vision encoders, or training data, reflecting industry opacity rather than curation gaps.
Progressive report timestamps in the README carry 2026 dates that sit awkwardly against the CVPR 2025 citation, so verify original publication dates before treating entries as historical fact.

Verdict

Bookmark this if you need a map of the VLM space without reading fifty abstracts a week. Skip it if you are hunting for model weights, unified inference APIs, or training scripts.

Frequently asked

What is zli12321/Vision-Language-Models-Overview?: A living survey that sorts the flood of vision-language research into architectures, benchmarks, and alignment methods.
Is Vision-Language-Models-Overview open source?: Yes — zli12321/Vision-Language-Models-Overview is an open-source project tracked on heatdrop.
What language is Vision-Language-Models-Overview written in?: zli12321/Vision-Language-Models-Overview is primarily written in HTML.
How popular is Vision-Language-Models-Overview?: zli12321/Vision-Language-Models-Overview has 673 stars on GitHub.
Where can I find Vision-Language-Models-Overview?: zli12321/Vision-Language-Models-Overview is on GitHub at https://github.com/zli12321/Vision-Language-Models-Overview.