Open video generation that fits on a single GPU
Alibaba's Wan2.1 ships 14B and 1.3B parameter models for text-to-video, image-to-video, and editing—claiming SOTA results without the SOTA hardware barrier.

What it does Wan2.1 is a suite of open video foundation models from Alibaba’s Wan team. It generates video from text prompts, still images, first-and-last frames, or editing instructions, and also handles text-to-image and video-to-audio. The repo provides inference code, model weights, and integrations for ComfyUI, Diffusers, and Gradio.
The interesting bit The 1.3B text-to-video model runs in ~8.2 GB VRAM—roughly a single RTX 4090—and generates 5 seconds of 480p video in about 4 minutes without quantization tricks. The project also claims to be the first video model that generates readable Chinese and English text inside the video itself, which is rarer than it sounds.
Key highlights
- 14B and 1.3B parameter variants for T2V, I2V, first-last-frame-to-video, and VACE (all-in-one editing)
- Wan-VAE encodes/decodes 1080p video of arbitrary length while preserving temporal information
- ComfyUI and Diffusers integrations shipped; Gradio demos included
- Active ecosystem: community projects include motion control (Wan-Move), virtual try-on (MagicTryOn), autonomous driving world models (DriVerse), and acceleration frameworks (TeaCache claims ~2x speedup)
- Weights hosted on both Hugging Face and ModelScope
Caveats
- The 1.3B model at 720p is described as “less stable” than at 480p due to limited training at that resolution
- First-last-frame-to-video is trained primarily on Chinese text-video pairs, so Chinese prompts are recommended for better results
- Several Diffusers + multi-GPU inference items remain unchecked on the todo list
Verdict Worth a look if you want open video generation with a consumer GPU option and a growing tooling ecosystem. Skip if you need guaranteed production reliability or mature multi-GPU Diffusers support today.