Wan-Video/Wan2.2

Alibaba's open video model grows up, adds experts

Wan2.2 brings Mixture-of-Experts to video diffusion, plus a 5B model that runs 720P@24fps on a consumer GPU.

★16.1k stars Python Image · Video · Audio Inference · Serving

View on GitHub ↗ Homepage ↗

Velocity · 7d

+51

★ / day

Trend

→steady

star history

What it does Wan2.2 is a family of open-source video generation models from Alibaba’s Wan team. It handles text-to-video, image-to-video, speech-to-video, and character animation. The flagship 14B models use a Mixture-of-Experts architecture; a smaller 5B model targets 720P at 24fps on hardware like an RTX 4090.

The interesting bit The MoE design splits the denoising process across timesteps into specialized “expert” sub-models. The README claims this expands capacity without increasing compute cost—a trick more common in LLMs than diffusion. The 5B TI2V model also packs a VAE with 16×16×4 compression, which is how it squeezes high-res generation into consumer VRAM.

Key highlights

Five model variants: T2V-A14B, I2V-A14B, TI2V-5B, S2V-14B (speech-driven), and Animate-14B (character animation)
14B models need ~80GB VRAM for single-GPU inference; the 5B model is the consumer-friendly option
Integrations already shipped for ComfyUI, Diffusers, and community tools like LightX2V and FastVideo
Training data grew +65.6% images and +83.2% videos versus Wan2.1
Apache 2.0 weights hosted on both Hugging Face and ModelScope

Caveats

The “TOP performance among all open-sourced and closed-sourced models” claim is stated but not substantiated with benchmarks in the README
80GB VRAM for the 14B models puts serious generation out of reach for most individual developers
The speech-to-video path requires an additional CosyVoice dependency

Verdict Worth a look if you’re building video generation pipelines and need open weights with broad modality coverage. Skip if you’re hoping to train from scratch or run top-tier models on modest hardware—the 5B model is usable, but the 14B variants are firmly workstation-or-cloud territory.