CyberAgentAILab/TANGO
A diffusion model that synthesizes realistic gesture videos from speech audio through hierarchical audio-motion embedding.

Velocity · 7d
+2.0
★ / day
Trend
→steady
star history
TANGO generates co-speech gesture videos by mapping audio features to body motion using hierarchical audio-motion embedding and diffusion interpolation. The model takes speech input and produces corresponding gesture animations, enabling video reenactment with realistic body language synchronized to audio.