← all repositories
researchmm/STTN

Erase moving objects from video with a transformer that sees space and time

A 2020 ECCV paper that treats video inpainting as a joint spatial-temporal attention problem, not frame-by-frame patchwork.

549 stars Jupyter Notebook Image · Video · Audio
STTN
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does STTN fills missing regions in videos—think removing a pedestrian who walks through your shot—by attending to patches across both space and time simultaneously. It uses multi-scale patch-based attention modules and a spatial-temporal adversarial loss, trained on standard datasets like YouTube-VOS and DAVIS.

The interesting bit Instead of the usual frame-by-frame or purely spatial approaches, STTN processes all input frames at once with joint spatial-temporal transformers. The attention visualization notebook suggests you can actually inspect where the model is “looking” across the video to borrow pixels for the hole.

Key highlights

  • Pretrained model available via Google Drive; one-line inference with test.py
  • Supports both stationary masks and moving-object masks (the harder, realistic case)
  • Includes TensorBoard training monitoring and a Jupyter notebook for attention visualization
  • ECCV 2020 paper with slides and project page still live
  • Conda environment file provided for reproducible setup

Caveats

  • The README is sparse on architecture details; you’ll need the paper for the full method
  • No explicit performance numbers or comparison tables in the repo itself
  • Inference examples are limited to a single schoolgirls demo video

Verdict Worth a look if you’re doing video restoration, object removal, or need a baseline transformer for spatiotemporal tasks. Skip if you need a polished production tool—this is research code with the usual rough edges.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.