Zero-shot video editing by stealing your own attention maps
FateZero edits real-world videos with text prompts using pretrained diffusion models, no per-video training required.

What it does FateZero performs text-driven edits on real videos—style transfers, attribute swaps, even shape changes—using only a pretrained Stable Diffusion model. You provide a source video and a text prompt; it returns an edited version with (claimed) temporal consistency. No retraining, no manual masks.
The interesting bit The trick is attention-map recycling. During DDIM inversion, FateZero captures intermediate self- and cross-attention maps, then fuses them back during denoising to preserve structure and motion. It also blends self-attentions using a mask derived from cross-attention features to keep the source video from leaking through. A spatial-temporal attention tweak in the UNet tries to keep frames from drifting apart.
Key highlights
- Three editing modes: style transfer, local attribute editing (e.g., “squirrel, carrot → rabbit, eggplant”), and shape editing via Tune-A-Video checkpoints
- Zero-shot: no per-prompt training, no user-provided masks
- Ships with Colab notebook and Hugging Face Space for quick experiments
- Low-resource configs available for 16 GB GPUs (down from ~100 GB CPU / 12 GB GPU for 8 frames on a 3090)
- ICCV 2023 Oral; code and data released for paper reproduction
Caveats
- Memory appetite is real: default settings want 100 GB CPU RAM; the “low-cost” config is still a 16 GB GPU
- Shape editing requires separate Tune-A-Video checkpoints (~10 GB downloads)
- Full data + checkpoints run to ~100 GB; setup involves conda, xformers (noted as “not stable”), and manual model placement
- Todo list still has “time & memory optimization” unchecked
Verdict Worth a look if you’re researching diffusion-based video editing or need a zero-shot baseline to beat. Practitioners should budget for hardware and patience—this is research code with research-code ergonomics, not a product.