Tencent-Hunyuan/HunyuanVideo-1.5

8.3B parameters, 720p video, one RTX 4090

Tencent's open video generator shrinks the hardware barrier without shrinking ambition.

★4.5k stars Python Image · Video · Audio

View on GitHub ↗ Homepage ↗

Velocity · 7d

+22

★ / day

Trend

→steady

star history

What it does

HunyuanVideo-1.5 generates text-to-video and image-to-video at up to 1080p, built around an 8.3B-parameter diffusion transformer and a 3D causal VAE. The repo ships inference code, model weights, training scripts, and integrations for ComfyUI, Diffusers, and LightX2V. A step-distilled variant can crank out a video on a single RTX 4090 in roughly 75 seconds.

The interesting bit

The SSTA (Selective and Sliding Tile Attention) mechanism prunes redundant spatiotemporal key-value blocks instead of brute-forcing attention over long sequences. The authors claim a 1.87× end-to-end speedup over FlashAttention-3 for 10-second 720p synthesis—attention efficiency as architecture, not afterthought.

Key highlights

8.3B DiT + 3D causal VAE with 16× spatial and 4× temporal compression
Step-distilled I2V model: 8–12 steps, ~75s generation on RTX 4090, 75% time reduction vs. base
Training code released: FSDP, context parallel, gradient checkpointing, plus the Muon optimizer they used
Cache inference supported: DeepCache, TeaCache, TaylorCache for further speedups
FP8 GEMM inference added December 2025
Community ecosystem: ComfyUI, Diffusers, LightX2V, and low-VRAM forks (Wan2GP claims 6 GB)

Caveats

Linux-only officially; Windows users are on their own or relying on community ports
Minimum 14 GB VRAM with model offloading—still hefty without distillation or caching tricks
Some model weights (sparse attention, SR models) remain unreleased per the open-source plan

Verdict

Worth a look if you’re building video generation pipelines and need something that actually fits consumer hardware. Skip if you need guaranteed production stability or mature Windows support—this is still research-grade tooling with sharp edges.