huggingface/speech-to-speech
A modular speech-to-speech pipeline for building local voice agents using open-source STT, LLM, and TTS models.

This repository implements a cascaded speech-to-speech pipeline combining Voice Activity Detection, Speech-to-Text (Whisper), Language Models, and Text-to-Speech synthesis. The pipeline is built around Hugging Face Transformers and supports local deployment on various devices including Apple Silicon via MLX. It provides multiple usage modes including real-time, server/client, and WebSocket approaches, with configurable backends for each stage of the pipeline.