Video editing by cheating at diffusion's hidden layer
TokenFlow keeps video frames consistent by propagating features inside Stable Diffusion instead of fixing pixels after the fact.

What it does TokenFlow takes a source video and a text prompt, then re-renders the video to match the prompt while keeping the original layout and motion intact. It rides on top of existing image editing tools—Plug-and-Play, ControlNet, SDEdit—rather than replacing them. No training, no fine-tuning.
The interesting bit The trick is enforcing consistency in the diffusion model’s feature space, not in pixel space. The authors propagate diffusion features across frames using correspondences the model already computes, which is the kind of “use what’s already there” move that saves months of GPU time.
Key highlights
- Built on Stable Diffusion; works with SD 1.x/2.x pipelines
- Preprocessing inverts the video to latent space; editing runs through YAML configs
- Supports structure-preserving edits: texture swaps, scene augmentations, semi-transparent effects
- ICLR 2024; Hugging Face demo available
- ~1.7k stars, active enough to have caught researcher attention
Caveats
- Requires a “good reconstruction” in preprocessing or editing fails; the README is vague on what “good” means quantitatively
- LDM decoder can introduce frame jitter, especially on certain source videos
- Only structure-preserving edits; don’t expect full object replacement or motion changes
- The “more information on arguments” link in preprocessing is a dead reference (“found here” with no URL)
Verdict Worth a look if you’re already doing diffusion-based image editing and need to batch it across video frames without retraining. Skip if you need heavy motion editing or can’t tolerate occasional decoder artifacts.