← all repositories
OpenImagingLab/FlashVSR

Diffusion video upscaling that finally stops for breath

A one-step diffusion model that streams 768×1408 video at ~17 FPS on a single A100 by saying no to redundant attention.

FlashVSR
Velocity · 7d
+6.9
★ / day
Trend
steady
star history

What it does FlashVSR upscales low-resolution video in real time using a distilled one-step diffusion pipeline. It targets 4× super-resolution and processes frames as a stream rather than chewing through entire clips offline. The authors claim ~17 FPS at 768×1408 on one A100, with roughly 12× speedup over prior one-step diffusion VSR approaches.

The interesting bit The speed comes from three deliberate constraints: a three-stage distillation that collapses the usual multi-step diffusion into a single pass, locality-constrained sparse attention that only attends where it matters, and a stripped-down conditional decoder. The sparse attention is particularly notable—third-party ComfyUI ports that omit it and fall back to dense attention visibly degrade quality, which the authors document with side-by-side examples.

Key highlights

  • One-step streaming diffusion for video SR, not the usual offline batch processing
  • Locality-constrained sparse attention cuts compute and bridges train/test resolution gaps
  • Tiny conditional decoder keeps reconstruction fast without (they claim) sacrificing quality
  • VSR-120K dataset: 120k videos + 180k images for joint training, though not yet released
  • v1.1 weights available on Hugging Face with “enhanced stability + fidelity”
  • Active third-party ecosystem: multiple ComfyUI nodes, cloud APIs, though quality varies

Caveats

  • Block-sparse attention compilation is memory-hungry and officially tested only on A100/A800; H200 works but with limited acceleration, and RTX 40/50 series compatibility is unknown
  • The project is “primarily designed and optimized for 4× SR”; other scales are second-class citizens
  • VSR-120K dataset is still unreleased, so replication of training from scratch is currently impossible
  • Third-party implementations (including some popular ComfyUI nodes) have shipped without the sparse attention module, producing visibly worse results

Verdict Worth a look if you need diffusion-level video upscaling without the usual multi-minute-per-frame tax. Skip it if you’re on consumer hardware or need flexible scaling factors—the A100 requirement and 4× lock-in are real constraints, not suggestions.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.