← all repositories
dandelin/ViLT

Vision-and-language models that skip the CNN entirely

ViLT treats images as patches and text as tokens, fed into one transformer—no object detectors, no ResNet backbone.

1.5k stars Python Other AI
ViLT
Velocity · 7d
+0.8
★ / day
Trend
steady
star history

What it does ViLT is a vision-and-language pre-training model that strips out the usual visual feature extraction pipeline. Instead of using a CNN or region-supervised object detector to preprocess images, it patches images into flat sequences and runs them through the same transformer that handles text. The repo ships with pretrained weights, fine-tuned checkpoints for VQA and image retrieval, and two Gradio demos you can run locally.

The interesting bit The authors noticed that in prior vision-and-language models, simply extracting visual features consumed more compute than the actual multimodal reasoning. By going convolution-free, ViLT claims to be “tens of times faster” than earlier VLP models while matching or beating their downstream scores. It’s a bet that the transformer can learn visual structure from raw patches if you just give it enough data and scale.

Key highlights

  • Single transformer architecture for both image patches and text tokens
  • Five pretrained/fine-tuned checkpoints available via GitHub releases (VQA, NLVR2, COCO/F30K retrieval)
  • Ready-to-run Gradio demos for masked language modeling visualization and VQA
  • Training and evaluation pipelines documented in separate markdown files
  • ICML 2021 long talk; code and weights are explicitly released for reuse

Caveats

  • The “tens of times faster” claim comes from the paper abstract; no explicit benchmark numbers are shown in the README itself
  • Demos pin Gradio to version 1.6.4, which is quite old and may need attention
  • Dataset preparation and training details are offloaded to separate docs, so the quick-start path is inference-only

Verdict Worth a look if you’re building vision-and-language systems and want to escape the object-detector-and-CNN tax. Less useful if you need the absolute state of the art—this is a 2021 baseline with architectural conviction, not a current leaderboard topper.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.