X-ray vision for multimodal transformers
A single method to visualize attention in any bi-modal or encoder-decoder Transformer, no retraining required.

What it does
This is the official PyTorch implementation of an ICCV 2021 Oral paper. It produces attention-based explanations for Transformer models that handle multiple input types—vision+language (VQA), image+text matching (CLIP), or encoder-decoder setups like DETR. The repo ships with ready-to-run Colab notebooks for LXMERT, DETR, CLIP, and plain ViT, plus scripts to reproduce the paper’s perturbation experiments.
The interesting bit
The method is generic—the same core approach works across architectures without architecture-specific retraining. That’s unusual in explainability, where techniques tend to be model-specific hacks. The authors achieve this by operating on the attention mechanism itself, making it applicable anywhere standard multi-head attention lives.
Key highlights
- Ready Colab notebooks for LXMERT, DETR, CLIP, and ViT (GPU required)
- Reproduction scripts for VisualBERT, LXMERT, and DETR with exact command-line incantations
- Hugging Face Spaces demo for CLIP grounding (built by external contributors)
- Perturbation-based evaluation protocol included, not just pretty heatmaps
- Works on pretrained models as-is; no fine-tuning for interpretability
Caveats
- Reproduction setup is involved: you need to manually patch
cocoeval.pyfor DETR and wrangle multiple dataset downloads - The README warns that requirement installation “may take some time” and requires a runtime restart
- VisualBERT depends on the somewhat heavy MMF framework
Verdict
Worth a look if you’re building or auditing multimodal systems and need more than “trust us, it works.” Skip if you just want a drop-in .explain() method—this is research code with research-code edges.