Dub any video forever, or make a person talk from a single photo
InfiniteTalk generates unlimited-length lip-synced talking video from either an existing video or a single image, using audio to drive head pose, expression, and body movement.

What it does
InfiniteTalk is an audio-driven video generation system built on top of Wan2.1-I2V-14B. Feed it a video plus an audio track and it re-synthesizes the subject with matched lip sync, head movement, body posture, and facial expressions. Feed it a single image plus audio and it generates a talking video from scratch. The “infinite-length” claim means streaming generation that isn’t hard-capped at a few seconds.
The interesting bit
Most dubbing tools fixate on lips and call it done. InfiniteTalk attempts to sync the whole body to speech rhythm, which is the harder and more noticeable problem. The trade-off is familiar to long-video generation: color drift and identity degradation accumulate over time, and the authors openly note that camera movement matching is approximate unless you accept more drift.
Key highlights
- Video-to-video and image-to-video modes; 480P and 720P output
- Built on Wan2.1-I2V-14B with custom audio conditioning weights
- TeaCache and int8 quantization supported for lower VRAM; multi-GPU inference available
- Gradio demo and ComfyUI branch provided
- Community integrations: Wan2GP (low-VRAM optimization) and kijai’s ComfyUI wrapper
Caveats
- Color shifts worsen after roughly 1 minute in image-to-video mode; the repo suggests a workaround (translate/zoom the static image into a short video) rather than fixing it
- Camera movement in video-to-video mode is mimicked, not reproduced; SDEdit improves accuracy but introduces its own color shift
- FusionX LoRA speeds things up but also degrades identity preservation over long clips
- Inference acceleration (LCM distillation, sparse attention) is still on the todo list
Verdict
Worth a look if you need long-form talking-head generation and can tolerate some manual tuning of CFG scales and workarounds for drift. If you need broadcast-perfect lip sync with locked camera motion out of the box, this isn’t there yet — though the authors are admirably upfront about where the seams show.