jdh-algo/JoyHallo
JoyHallo is a deep learning model that generates talking face videos from audio input, specifically optimized for Mandarin speech.

Velocity · 7d
+0.8
★ / day
Trend
→steady
star history
JoyHallo is an audio-driven video generation model for creating Mandarin talking head videos. It uses a Chinese wav2vec2 model to extract audio features and employs a semi-decoupled structure to capture relationships among lip, expression, and pose features. The model was trained on 29 hours of Mandarin speech video data collected from JD Health employees, including diverse speaking styles and medical topics.