Is Step-Audio2 open source?

Yes — stepfun-ai/Step-Audio2 is open source, released under the Apache-2.0 license.

What language is Step-Audio2 written in?

stepfun-ai/Step-Audio2 is primarily written in Python.

How popular is Step-Audio2?

stepfun-ai/Step-Audio2 has 1.5k stars on GitHub.

Where can I find Step-Audio2?

stepfun-ai/Step-Audio2 is on GitHub at https://github.com/stepfun-ai/Step-Audio2.

← all repositories

stepfun-ai/Step-Audio2

A voice LLM that hears emotion, not just words

Step-Audio 2 open-sources mini variants of an audio-native LLM that claims to beat commercial ASR models while reasoning about how you speak, not just what you say.

★1.5k stars Python Image · Video · Audio Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does This end-to-end model handles speech recognition, audio understanding, and voice conversation in a single pass. It attempts to infer paralinguistic details—age, emotion, tone—alongside semantic content, and can respond naturally in spoken dialogue. The open-source release includes three “mini” checkpoints—Step-Audio 2 mini, mini Base, and mini Think—under an Apache 2.0 license.

The interesting bit Rather than treating voice as a text pipeline, the model reasons about acoustic context directly and can retrieve real-world knowledge via multimodal RAG, including switching its output timbre based on retrieved speech samples. The project also publishes its own evaluation benchmarks for paralinguistic understanding and audio tool calling, suggesting it wants to define the rules it plays by.

Key highlights

Benchmark tables in the README claim top ASR results on English and Chinese test sets against GPT-4o Transcribe, Kimi-Audio, and Qwen-Omni
Supports tool calling and RAG using both textual and acoustic knowledge, with timbre switching based on retrieved audio
Ships with a vLLM backend for streaming inference and multi-GPU deployment
Includes dedicated evaluation benchmarks for paralinguistic understanding and audio tool calling

Caveats

The open weights are “mini” variants; the full model shown in many benchmark leaderboards is not released
The README shows a trust-remote-code flag for vLLM serving, which complicates auditing
Several sections remain skeletal, including a TODO comment in the news timeline and sparse architectural detail

Verdict Worth a look if you need an open, Apache-licensed speech model with built-in emotion and tool reasoning, but wait for more documentation if you need to understand the training stack or audit the full architecture.

Frequently asked

What is stepfun-ai/Step-Audio2?: Step-Audio 2 open-sources mini variants of an audio-native LLM that claims to beat commercial ASR models while reasoning about how you speak, not just what you say.
Is Step-Audio2 open source?: Yes — stepfun-ai/Step-Audio2 is open source, released under the Apache-2.0 license.
What language is Step-Audio2 written in?: stepfun-ai/Step-Audio2 is primarily written in Python.
How popular is Step-Audio2?: stepfun-ai/Step-Audio2 has 1.5k stars on GitHub.
Where can I find Step-Audio2?: stepfun-ai/Step-Audio2 is on GitHub at https://github.com/stepfun-ai/Step-Audio2.