Fantasy-AMAP/fantasy-talking
A diffusion-transformer system that generates realistic talking portrait videos from audio input by synthesizing coherent facial motion.

Velocity · 7d
+3.7
★ / day
Trend
→steady
star history
FantasyTalking produces photorealistic talking head videos driven by audio conditions. It leverages a diffusion transformer architecture (Wan2.1) as the base generative model with Wav2Vec for audio encoding. The system synthesizes coherent facial motions including lip movements, expressions, and head poses to create natural talking portraits. Published at ACM MM 2025.