Your laptop is now a voice AI that actually sees you
A weekend project proves you don't need OpenAI's servers—or an RTX 5090—to run real-time multimodal voice conversations locally.

What it does
Parlor is a browser-based voice assistant that runs entirely on your machine. You talk, point your camera at things, and it talks back. The heavy lifting happens locally via a FastAPI server: Google’s Gemma 4 E2B model handles speech and vision understanding through LiteRT-LM, while Kokoro generates text-to-speech responses. A simple WebSocket pipes audio and JPEG frames from your browser to the server and streams synthesized speech back.
The interesting bit
The author built this to solve a real sustainability problem—he was self-hosting a free English-learning voice AI for hundreds of users and needed to kill the server bill. Six months ago that required an RTX 5090. Now it runs on an M3 Pro laptop with ~3 GB RAM. The “barge-in” feature is a nice touch: you can interrupt the AI mid-sentence, which is harder to get right than it sounds when everything is streaming in real time.
Key highlights
- End-to-end latency of ~2.5–3.0 seconds on Apple M3 Pro (1.8–2.2s for speech/vision understanding, 0.3s for ~25 tokens, 0.3–0.7s for TTS)
- Decode speed: ~83 tokens/sec on GPU via LiteRT-LM
- Sentence-level TTS streaming means audio starts before the full response is finished
- Browser-based VAD (Silero) for hands-free operation, no push-to-talk button
- Platform-aware TTS: MLX on Mac, ONNX on Linux
- ~2.6 GB model download on first run, auto-fetched from HuggingFace
Caveats
- Explicitly marked “research preview” with expected rough edges and bugs
- macOS requires Apple Silicon; Linux needs a supported GPU
- Python 3.12+ only, and the frontend is a single
index.html—don’t expect a polished UI - The author notes you “can’t do agentic coding with this”; it’s narrowly scoped to conversation
Verdict
Worth a spin if you’re building local AI assistants, teaching language learners, or just want to see how far small models have come. Skip it if you need reliability, broad hardware support, or anything beyond a conversational demo—the author is upfront that this is an early experiment, not a product.