← all repositories

kyutai-labs/moshi

A speech-text foundation model enabling real-time full-duplex spoken dialogue with streaming neural audio codec.

moshi
Velocity · 7d
+15
★ / day
Trend
steady
star history

Moshi is a foundation model that processes and generates both speech and text tokens for real-time conversational interaction. It uses Mimi, a streaming neural audio codec, to encode and decode audio streams. The repository provides multiple inference implementations: PyTorch for research, MLX for Apple Silicon devices, and Rust for production deployments. It also supports related models like Hibiki for speech translation.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.