kyutai-labs/delayed-streams-modeling
Kyutai Labs' speech-to-text and text-to-speech models using Delayed Streams Modeling for real-time streaming audio processing.

This repository provides Kyutai’s STT and TTS models based on the Delayed Streams Modeling (DSM) framework, a technique for streaming speech-to-text and text-to-speech tasks. The STT models include a 1B parameter English/French model with 0.5 second delay and a 2.6B English-only model with 2.5 second delay. Both models support real-time streaming inference, efficient batching (400 streams per H100), and word-level timestamps. The framework is documented in a pre-print at arxiv.org/abs/2509.08753.