A voice bot that actually listens while it talks
Quillman wires Kyutai's Moshi model into a real-time speech-to-speech chat app, streaming audio both ways over a single websocket.

What it does
Quillman is a complete voice chat application built on Moshi, a speech-to-speech language model from Kyutai Labs. The backend runs a streaming encoder/decoder called Mimi to keep audio flowing continuously in both directions, plus a speech-text model that decides when to jump in and what to say. A React frontend and a FastAPI websocket server handle the plumbing, with Opus compression keeping latency low enough to feel like a real conversation.
The interesting bit
The bidirectional streaming is the trick. Most voice assistants record, stop, transcribe, think, then speak. Moshi listens and plans while it talks, so interruptions and overlaps work more like human dialogue. The README claims response times can be “nearly instantaneous” on good internet, matching human speech cadence.
Key highlights
- Built on Moshi, Mimi, and a speech-text foundation model from Kyutai Labs
- Bidirectional websocket streaming with Opus audio compression
- Deployed via Modal, which scales to zero when idle (no idle costs)
- Includes a terminal client for testing without the frontend
- Meant as a starter template for other language model apps
Caveats
- Requires a Modal account and token; not a simple
docker run - Code is “for illustration only”; commercial use requires checking model licenses
- Frontend changes may need browser cache clearing during development
Verdict
Worth a look if you’re building real-time voice AI and want a working reference architecture on serverless infrastructure. Skip it if you need a drop-in, self-hosted solution without cloud dependencies.