stepfun-ai/Step-Audio-EditX
A 3B-parameter LLM-based audio editing model that controls emotion, speaking style, and paralinguistics via reinforcement learning.

Velocity · 7d
+4.2
★ / day
Trend
→steady
star history
Step-Audio-EditX is a large language model for audio editing and synthesis. It enables editing of emotion, speaking style, and paralinguistic features in audio while supporting zero-shot text-to-speech. The model is trained using reinforcement learning techniques including SFT, DPO, and GRPO. It supports cross-lingual capabilities including English, Japanese, and Korean, and can be deployed via vLLM for efficient inference.