8.3B parameters, 720p video, one RTX 4090
Tencent's open video generator shrinks the hardware barrier without shrinking ambition.

What it does
HunyuanVideo-1.5 generates text-to-video and image-to-video at up to 1080p, built around an 8.3B-parameter diffusion transformer and a 3D causal VAE. The repo ships inference code, model weights, training scripts, and integrations for ComfyUI, Diffusers, and LightX2V. A step-distilled variant can crank out a video on a single RTX 4090 in roughly 75 seconds.
The interesting bit
The SSTA (Selective and Sliding Tile Attention) mechanism prunes redundant spatiotemporal key-value blocks instead of brute-forcing attention over long sequences. The authors claim a 1.87× end-to-end speedup over FlashAttention-3 for 10-second 720p synthesis—attention efficiency as architecture, not afterthought.
Key highlights
- 8.3B DiT + 3D causal VAE with 16× spatial and 4× temporal compression
- Step-distilled I2V model: 8–12 steps, ~75s generation on RTX 4090, 75% time reduction vs. base
- Training code released: FSDP, context parallel, gradient checkpointing, plus the Muon optimizer they used
- Cache inference supported: DeepCache, TeaCache, TaylorCache for further speedups
- FP8 GEMM inference added December 2025
- Community ecosystem: ComfyUI, Diffusers, LightX2V, and low-VRAM forks (Wan2GP claims 6 GB)
Caveats
- Linux-only officially; Windows users are on their own or relying on community ports
- Minimum 14 GB VRAM with model offloading—still hefty without distillation or caching tricks
- Some model weights (sparse attention, SR models) remain unreleased per the open-source plan
Verdict
Worth a look if you’re building video generation pipelines and need something that actually fits consumer hardware. Skip if you need guaranteed production stability or mature Windows support—this is still research-grade tooling with sharp edges.