Erase moving objects from video with a transformer that sees space and time
A 2020 ECCV paper that treats video inpainting as a joint spatial-temporal attention problem, not frame-by-frame patchwork.

What it does STTN fills missing regions in videos—think removing a pedestrian who walks through your shot—by attending to patches across both space and time simultaneously. It uses multi-scale patch-based attention modules and a spatial-temporal adversarial loss, trained on standard datasets like YouTube-VOS and DAVIS.
The interesting bit Instead of the usual frame-by-frame or purely spatial approaches, STTN processes all input frames at once with joint spatial-temporal transformers. The attention visualization notebook suggests you can actually inspect where the model is “looking” across the video to borrow pixels for the hole.
Key highlights
- Pretrained model available via Google Drive; one-line inference with
test.py - Supports both stationary masks and moving-object masks (the harder, realistic case)
- Includes TensorBoard training monitoring and a Jupyter notebook for attention visualization
- ECCV 2020 paper with slides and project page still live
- Conda environment file provided for reproducible setup
Caveats
- The README is sparse on architecture details; you’ll need the paper for the full method
- No explicit performance numbers or comparison tables in the repo itself
- Inference examples are limited to a single
schoolgirlsdemo video
Verdict Worth a look if you’re doing video restoration, object removal, or need a baseline transformer for spatiotemporal tasks. Skip if you need a polished production tool—this is research code with the usual rough edges.