Is CleanS2S open source?

Yes — opendilab/CleanS2S is open source, released under the Apache-2.0 license.

What language is CleanS2S written in?

opendilab/CleanS2S is primarily written in Python.

How popular is CleanS2S?

opendilab/CleanS2S has 534 stars on GitHub.

Where can I find CleanS2S?

opendilab/CleanS2S is on GitHub at https://github.com/opendilab/CleanS2S.

← all repositories

opendilab/CleanS2S

One-file voice bot that talks, listens, and interrupts

CleanS2S packs a real-time, interruptible speech-to-speech pipeline into a single Python file so researchers can prototype voice agents without drowning in configuration.

★534 stars Python Agents Image · Video · Audio

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

CleanS2S is a standalone Python prototype that listens to you, reasons through an LLM, and talks back in real time over a WebSocket. It chains ASR, LLM, and TTS into one continuous streaming loop, aiming for the fluid, full-duplex voice chat that GPT-4o popularized. The entire pipeline lives in a single file, meant to be read and hacked on rather than deployed as a finished product.

The interesting bit

The project treats “single-file” as a feature, not a constraint: voice activity detection, interruption handling, and even proactive “subjective action judgement” are all crammed into one script you can actually trace through. It also layers in optional web search and RAG so the agent can inject live facts instead of just waiting for its turn.

Key highlights

One-file pipeline: the whole ASR → LLM → TTS agent, plus WebSocket receiver and sender, is contained in a single standalone script.
Full-duplex with interruptions: users can talk over the agent, and it will abort the current response to process new input while keeping conversation context.
Pluggable backends: defaults to LLM APIs like DeepSeek but can be pointed at local models such as DeepSeek-V2.5 or Qwen2.5.
Live data hooks: includes WebSearchHelper and RAG classes to pull online or retrieved knowledge into responses.
Prosody transfer: uses short reference audio clips with CosyVoice-300M to clone timbre and intonation.

Caveats

The ASR default is paraformer-zh and the demos are Chinese-centric, so English-first users will likely need to swap in their own speech recognition models.
Output tokens are explicitly capped small due to the authors’ compute resource limits.
Despite being one file, the script still depends on several heavy local models (three ASR, one TTS) and a reference audio directory.

Verdict

Grab this if you are a researcher or tinkerer who wants to understand or extend a speech-to-speech pipeline without navigating a sprawling codebase. Skip it if you need a production-ready, multilingual voice agent that works out of the box.

Frequently asked

What is opendilab/CleanS2S?: CleanS2S packs a real-time, interruptible speech-to-speech pipeline into a single Python file so researchers can prototype voice agents without drowning in configuration.
Is CleanS2S open source?: Yes — opendilab/CleanS2S is open source, released under the Apache-2.0 license.
What language is CleanS2S written in?: opendilab/CleanS2S is primarily written in Python.
How popular is CleanS2S?: opendilab/CleanS2S has 534 stars on GitHub.
Where can I find CleanS2S?: opendilab/CleanS2S is on GitHub at https://github.com/opendilab/CleanS2S.