bytedance/LatentSync
End-to-end lip-sync method using audio-conditioned latent diffusion models built on Stable Diffusion.

LatentSync enables automatic lip synchronization in videos given audio input. It leverages Whisper to convert audio to embeddings, which are integrated into a U-Net via cross-attention, and uses Stable Diffusion’s latent space for generation. The system concatenates reference and masked frames with noised latents as input, training with a one-step method to estimate clean latents from predicted noise. It supports both Chinese and English video content with temporal consistency improvements in recent versions.