Dance diffusion: when your GPU learns to boogie
A research implementation that generates editable 3D dance choreography from raw music using transformer diffusion and Jukebox features.

What it does EDGE takes a music file (WAV) and generates plausible human dance motion as 3D joint positions. It uses a transformer-based diffusion model conditioned on Jukebox music features, and can do targeted edits like joint-wise conditioning or in-betweening to fill gaps between existing poses. The output can be converted to FBX for Blender rendering.
The interesting bit The authors paired a diffusion model with Jukebox—not a lightweight music encoder, but the full 5-billion-parameter generative model—then added a custom metric called Physical Foot Contact (PFC) to penalize impossible foot sliding. The result passed a large-scale user study, which is rarer in generative motion work than you’d think.
Key highlights
- Editable generation: joint-wise conditioning and in-betweening for fine control
- Outputs SMPL-format motion, convertible to FBX for Blender/Mixamo pipelines
- Includes PFC evaluation metric for physical plausibility
- Pre-trained checkpoint available; training on AIST++ takes ~6–24 hours with 1–8 high-end GPUs
- Optional feature caching to avoid re-extracting Jukebox representations on every run
Caveats
- Windows is “not officially supported”; validated only on Debian 10 with NVIDIA T4
- Jukebox feature extraction is memory-hungry and slow; full dataset preprocessing takes ~24 hours and ~50 GB
- The authors explicitly state this is a research implementation that “will not be regularly updated or maintained long after release”
- File names with spaces or parentheses in
--music_dircause “unpredictable behavior”
Verdict Worth a look if you’re doing generative motion research or need a baseline for music-conditioned dance generation. Skip if you want production-ready tooling or lack the GPU memory (16 GB minimum) and patience for Jukebox preprocessing.