kyutai-labs/moshi
A speech-text foundation model enabling real-time full-duplex spoken dialogue with streaming neural audio codec.

Velocity · 7d
+15
★ / day
Trend
→steady
star history
Moshi is a foundation model that processes and generates both speech and text tokens for real-time conversational interaction. It uses Mimi, a streaming neural audio codec, to encode and decode audio streams. The repository provides multiple inference implementations: PyTorch for research, MLX for Apple Silicon devices, and Rust for production deployments. It also supports related models like Hibiki for speech translation.