HumanAIGC/EMO
An audio-driven portrait video generation system using a diffusion model to create expressive talking-head videos from audio input.

EMO is a diffusion-based approach that synthesizes expressive portrait videos directly from audio input without requiring explicit 3D representations or intermediatelandmarks. The model generates talking-head videos with natural facial expressions, head movements, and synchronized lip motion by learning audio-visual correspondences through a weak-conditioning framework. It was published at ECCV 2024 by researchers from Alibaba Group’s Institute for Intelligent Computing.