A voice assistant that doesn't phone home
Rhasspy is an offline-first voice toolkit for Home Assistant users who'd rather not ship their living-room conversations to a cloud server.

What it does
Rhasspy turns voice commands into JSON events that trigger automations in Home Assistant, Hass.io, or Node-RED. You define commands in a profile using a template syntax, and it handles wake-word detection, speech-to-text, intent recognition, and text-to-speech — all without an internet connection. It runs in Docker and exposes a web UI on port 12101.
The interesting bit
The project is essentially a curated integration layer: it bundles existing open-source engines (Pocketsphinx, Kaldi, Porcupine, Fsticuffs, eSpeak, etc.) and maps their uneven language support into a single coherent system. The matrix of which engine supports which language is refreshingly honest — Vietnamese speech-to-text only works with Kaldi, for instance, and several wake-word engines need custom training for non-English use.
Key highlights
- Supports 14 languages, though coverage varies wildly by component
- Wake-word options: Pocketsphinx (broadest), Porcupine (English-only), Snowboy/Precise (require training for most languages)
- Intent recognition is the most uniformly supported layer; Fsticuffs, fuzzywuzzy, and Adapt work offline for all languages
- Text-to-speech defaults to eSpeak for universal offline coverage, with MaryTTS, PicoTTS, and others as alternatives
- Explicitly targets “advanced users” comfortable writing their own Home Assistant automations
Caveats
- This repository contains version 2.4; active development has moved to
github.com/rhasspy/rhasspy(version 2.5) - The author candidly recommends Mycroft if you want something easier and don’t mind cloud processing
- Language support is a patchwork: Mandarin and Hindi have wake-word and STT coverage, but no TTS via MaryTTS; Swedish lacks wake-word support entirely
Verdict
Worth a look if you run Home Assistant, value privacy, and don’t mind trading polish for control. Skip it if you want turnkey setup or natural-sounding voice synthesis — eSpeak’s robot diction is functional, not friendly.