← all repositories
hila-chefer/Transformer-MM-Explainability

X-ray vision for multimodal transformers

A single method to visualize attention in any bi-modal or encoder-decoder Transformer, no retraining required.

908 stars Jupyter Notebook LLMOps · EvalComputer Vision
Transformer-MM-Explainability
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

This is the official PyTorch implementation of an ICCV 2021 Oral paper. It produces attention-based explanations for Transformer models that handle multiple input types—vision+language (VQA), image+text matching (CLIP), or encoder-decoder setups like DETR. The repo ships with ready-to-run Colab notebooks for LXMERT, DETR, CLIP, and plain ViT, plus scripts to reproduce the paper’s perturbation experiments.

The interesting bit

The method is generic—the same core approach works across architectures without architecture-specific retraining. That’s unusual in explainability, where techniques tend to be model-specific hacks. The authors achieve this by operating on the attention mechanism itself, making it applicable anywhere standard multi-head attention lives.

Key highlights

  • Ready Colab notebooks for LXMERT, DETR, CLIP, and ViT (GPU required)
  • Reproduction scripts for VisualBERT, LXMERT, and DETR with exact command-line incantations
  • Hugging Face Spaces demo for CLIP grounding (built by external contributors)
  • Perturbation-based evaluation protocol included, not just pretty heatmaps
  • Works on pretrained models as-is; no fine-tuning for interpretability

Caveats

  • Reproduction setup is involved: you need to manually patch cocoeval.py for DETR and wrangle multiple dataset downloads
  • The README warns that requirement installation “may take some time” and requires a runtime restart
  • VisualBERT depends on the somewhat heavy MMF framework

Verdict

Worth a look if you’re building or auditing multimodal systems and need more than “trust us, it works.” Skip if you just want a drop-in .explain() method—this is research code with research-code edges.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.