Is speech-to-speech open source?

Yes — huggingface/speech-to-speech is open source, released under the Apache-2.0 license.

What language is speech-to-speech written in?

huggingface/speech-to-speech is primarily written in Python.

How popular is speech-to-speech?

huggingface/speech-to-speech has 6.3k stars on GitHub and is currently cooling off.

Where can I find speech-to-speech?

huggingface/speech-to-speech is on GitHub at https://github.com/huggingface/speech-to-speech.

← all repositories

huggingface/speech-to-speech

The boring plumbing behind local voice agents

A modular speech-to-speech pipeline that exposes an OpenAI Realtime-compatible WebSocket API so you can run voice agents on local or open-source models instead of proprietary cloud services.

★6.3k stars Python Agents Inference · Serving Language Models

View on GitHub ↗

Velocity · 7d

+21

★ / day

Trend

↘cooling

star history

What it does Hugging Face’s speech-to-speech is a cascaded voice-agent pipeline that wires together voice activity detection, speech-to-text, a language model, and text-to-speech. It exposes an OpenAI Realtime-compatible WebSocket endpoint at /v1/realtime, so existing clients can connect to local or open-weight models without code changes. You can run it entirely on-device—via transformers, mlx-lm, or Apple Silicon-optimized variants—or point it at self-hosted servers and provider APIs.

The interesting bit The project treats protocol compatibility as a first-class feature. By cloning OpenAI’s Realtime API event schema—down to session updates, turn detection, and streaming audio deltas—it lets you swap a cloud dependency for a local stack without rewriting your client. The modularity is the point: each stage is hot-swappable, with support for Whisper, Parakeet TDT, Qwen3-TTS, Kokoro, and others.

Key highlights

OpenAI Realtime-compatible WebSocket server with streaming transcription, interruption handling, and tool calls
Fully local inference path on Apple Silicon via MLX-optimized STT, LLM, and TTS backends
Modular pipeline: swap Silero VAD, Whisper or Parakeet for STT, and ChatTTS, Pocket TTS, or Qwen3 for speech synthesis
TCP and WebSocket server/client modes for remote audio streaming
Single pyproject.toml with platform markers to handle macOS and non-macOS dependencies automatically

Caveats

DeepFilterNet requires numpy<2 while Pocket TTS requires numpy>=2, so you cannot use both audio enhancement and that TTS backend in the same environment
The default CLI configuration targets an OpenAI-compatible LLM API, so running fully offline requires explicitly selecting local backends like mlx-lm or transformers

Verdict Worth a look if you are building voice agents and want an escape hatch from proprietary APIs, or need a local Apple Silicon stack. Skip it if you are looking for a single end-to-end model rather than a modular pipeline.

Frequently asked

What is huggingface/speech-to-speech?: A modular speech-to-speech pipeline that exposes an OpenAI Realtime-compatible WebSocket API so you can run voice agents on local or open-source models instead of proprietary cloud services.
Is speech-to-speech open source?: Yes — huggingface/speech-to-speech is open source, released under the Apache-2.0 license.
What language is speech-to-speech written in?: huggingface/speech-to-speech is primarily written in Python.
How popular is speech-to-speech?: huggingface/speech-to-speech has 6.3k stars on GitHub and is currently cooling off.
Where can I find speech-to-speech?: huggingface/speech-to-speech is on GitHub at https://github.com/huggingface/speech-to-speech.