ace-step/ACE-Step
Open-source foundation model for music generation using diffusion and deep compression autoencoders.

ACE-Step is a music generation foundation model that integrates diffusion-based generation with Sana’s Deep Compression AutoEncoder and a lightweight linear transformer. It leverages MERT and m-hubert for semantic representation alignment during training, enabling rapid convergence. The model can synthesize up to 4 minutes of music in approximately 20 seconds on an A100 GPU, significantly faster than LLM-based baselines while maintaining superior musical coherence and controllability.