researchmm/MM-Diffusion
A PyTorch implementation of a diffusion model that generates aligned audio-video pairs using a sequential multi-modal U-Net with separate audio and video subnets.

Velocity · 7d
+0.4
★ / day
Trend
→steady
star history
This repository implements the MM-Diffusion framework for joint audio and video generation, accepted at CVPR 2023. It uses a sequential multi-modal U-Net architecture where two subnets learn to generate aligned audio-video pairs from Gaussian noise. The model supports conditional generation and was trained on datasets including landscape, AIST++, and AudioSet.