A curated map of the BERT-vision explosion
A hand-maintained index of 70+ papers tracing how transformers swallowed computer vision whole.

What it does This repo is a reading list: papers, arXiv links, and occasional code references for vision-language pretrained models (VL-PTMs) from 2019 through mid-2021. It covers image-based, video-based, and even speech-based variants, sorted into representation learning, task-specific work, and analysis.
The interesting bit The curation itself is the artifact. You can watch the field’s evolution in real time — from ViLBERT and LXMERT’s careful cross-modal fusion to ViLT ditching convolutions entirely, to Florence claiming “foundation model” status before that term fully curdled. The maintainer also flags rough edges the community worried about: social bias, adversarial fragility, and whether all this pretraining is actually being done right.
Key highlights
- ~70 papers with direct arXiv/conference links, many with code
- Covers image, video, and speech modalities plus “other transformer-based multimodal networks”
- Explicit sections for critical analysis: bias, robustness, architecture search, multi-task unification
- Last updated June 2021 — captures the pre-CLIP mainstream explosion
- Includes niche task-specific work (TextVQA, chart VQA, visual navigation) often missing from broad surveys
Caveats
- Frozen in mid-2021; misses the later diffusion and LLM-native multimodal wave
- No search, no tagging, no abstracts — pure hierarchical markdown
- Some entries are just titles and links; quality of annotation varies
Verdict Useful if you’re tracing historical lineage or writing a literature review on the 2019–2021 transformer-vision convergence. Skip it if you need current SOTA or interactive filtering; this is a bibliography, not a database.