← all repositories
Tencent-Hunyuan/HunyuanVideo

Tencent's 13B-parameter open video model goes toe-to-toe with closed rivals

A research team bet that scale and a repurposed MLLM text encoder could close the gap between open-source and proprietary video generation.

HunyuanVideo
Velocity · 7d
+22
★ / day
Trend
steady
star history

What it does HunyuanVideo is a text-to-video diffusion transformer with over 13 billion parameters—the largest open-source video model at release. It generates 720p video from text prompts using PyTorch inference code and officially released weights. The repo includes single-GPU scripts, multi-GPU parallel inference via xDiT, FP8 quantized weights for memory relief, and a Gradio web demo.

The interesting bit Instead of the usual CLIP-plus-T5 text encoder combo, the team uses a decoder-only Multimodal Large Language Model (MLLM) fine-tuned on visual instructions. They claim this yields better image-text alignment in feature space, which helps the diffusion model actually follow complex prompts. The architecture also uses a “dual-stream to single-stream” transformer design: video and text tokens start in separate lanes, then merge for multimodal fusion.

Key highlights

  • 13B parameters trained with image-video joint modeling and a causal 3D VAE for spatial-temporal compression
  • MLLM text encoder with decoder-only structure, not the typical CLIP/T5-XXL stack
  • FP8 weights and multi-GPU sequence-parallel inference available for faster or leaner runs
  • Integrated into HuggingFace Diffusers and ComfyUI; extensive community forks for GGUF, CPU-offloading, and acceleration
  • Human evaluation claims it outperforms Runway Gen-3, Luma 1.6, and top Chinese closed models—though the README doesn’t detail evaluator methodology

Caveats

  • The README is vague on exact VRAM requirements for full-precision inference; FP8 and community “GPU Poor” forks exist for a reason
  • Human evaluation results are asserted without reproducible benchmark detail in the provided text
  • Active development has moved forward: HunyuanVideo-1.5, I2V, avatar, and custom variants now live in separate repos

Verdict Worth a look if you’re building video generation pipelines and want a fully open weights baseline with broad ecosystem support. Skip if you need turnkey consumer hardware support without quantization hacks or if you require rigorous, reproducible benchmark documentation before trusting performance claims.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.