← all repositories
yuewang-cuhk/awesome-vision-language-pretraining-papers

A curated map of the BERT-vision explosion

A hand-maintained index of 70+ papers tracing how transformers swallowed computer vision whole.

awesome-vision-language-pretraining-papers
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does This repo is a reading list: papers, arXiv links, and occasional code references for vision-language pretrained models (VL-PTMs) from 2019 through mid-2021. It covers image-based, video-based, and even speech-based variants, sorted into representation learning, task-specific work, and analysis.

The interesting bit The curation itself is the artifact. You can watch the field’s evolution in real time — from ViLBERT and LXMERT’s careful cross-modal fusion to ViLT ditching convolutions entirely, to Florence claiming “foundation model” status before that term fully curdled. The maintainer also flags rough edges the community worried about: social bias, adversarial fragility, and whether all this pretraining is actually being done right.

Key highlights

  • ~70 papers with direct arXiv/conference links, many with code
  • Covers image, video, and speech modalities plus “other transformer-based multimodal networks”
  • Explicit sections for critical analysis: bias, robustness, architecture search, multi-task unification
  • Last updated June 2021 — captures the pre-CLIP mainstream explosion
  • Includes niche task-specific work (TextVQA, chart VQA, visual navigation) often missing from broad surveys

Caveats

  • Frozen in mid-2021; misses the later diffusion and LLM-native multimodal wave
  • No search, no tagging, no abstracts — pure hierarchical markdown
  • Some entries are just titles and links; quality of annotation varies

Verdict Useful if you’re tracing historical lineage or writing a literature review on the 2019–2021 transformer-vision convergence. Skip it if you need current SOTA or interactive filtering; this is a bibliography, not a database.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.