Is awesome-vision-language-pretraining-papers open source?

Yes — yuewang-cuhk/awesome-vision-language-pretraining-papers is an open-source project tracked on heatdrop.

How popular is awesome-vision-language-pretraining-papers?

yuewang-cuhk/awesome-vision-language-pretraining-papers has 1.2k stars on GitHub.

Where can I find awesome-vision-language-pretraining-papers?

yuewang-cuhk/awesome-vision-language-pretraining-papers is on GitHub at https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers.

← all repositories

yuewang-cuhk/awesome-vision-language-pretraining-papers

A curated map of the BERT-vision explosion

A hand-maintained index of 70+ papers tracing how transformers swallowed computer vision whole.

★1.2k stars Learning Language Models

View on GitHub ↗

awesome-vision-language-pretraining-papers

Not currently ranked — collecting fresh signals.

star history

What it does This repo is a reading list: papers, arXiv links, and occasional code references for vision-language pretrained models (VL-PTMs) from 2019 through mid-2021. It covers image-based, video-based, and even speech-based variants, sorted into representation learning, task-specific work, and analysis.

The interesting bit The curation itself is the artifact. You can watch the field’s evolution in real time — from ViLBERT and LXMERT’s careful cross-modal fusion to ViLT ditching convolutions entirely, to Florence claiming “foundation model” status before that term fully curdled. The maintainer also flags rough edges the community worried about: social bias, adversarial fragility, and whether all this pretraining is actually being done right.

Key highlights

~70 papers with direct arXiv/conference links, many with code
Covers image, video, and speech modalities plus “other transformer-based multimodal networks”
Explicit sections for critical analysis: bias, robustness, architecture search, multi-task unification
Last updated June 2021 — captures the pre-CLIP mainstream explosion
Includes niche task-specific work (TextVQA, chart VQA, visual navigation) often missing from broad surveys

Caveats

Frozen in mid-2021; misses the later diffusion and LLM-native multimodal wave
No search, no tagging, no abstracts — pure hierarchical markdown
Some entries are just titles and links; quality of annotation varies

Verdict Useful if you’re tracing historical lineage or writing a literature review on the 2019–2021 transformer-vision convergence. Skip it if you need current SOTA or interactive filtering; this is a bibliography, not a database.

Frequently asked

What is yuewang-cuhk/awesome-vision-language-pretraining-papers?: A hand-maintained index of 70+ papers tracing how transformers swallowed computer vision whole.
Is awesome-vision-language-pretraining-papers open source?: Yes — yuewang-cuhk/awesome-vision-language-pretraining-papers is an open-source project tracked on heatdrop.
How popular is awesome-vision-language-pretraining-papers?: yuewang-cuhk/awesome-vision-language-pretraining-papers has 1.2k stars on GitHub.
Where can I find awesome-vision-language-pretraining-papers?: yuewang-cuhk/awesome-vision-language-pretraining-papers is on GitHub at https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers.