A $5 chip that talks to DeepSeek and controls your lights
XiaoZhi crams a voice-activated LLM chatbot onto ESP32 hardware, using the MCP protocol to bridge AI models with physical devices.

What it does
XiaoZhi is a voice-interaction firmware for ESP32 microcontrollers. It handles wake-word detection, streaming speech recognition, LLM inference via Qwen or DeepSeek, and text-to-speech — all on a chip that costs less than a coffee. The project also supports 4G connectivity, speaker recognition, and emoji-capable displays.
The interesting bit
The project treats MCP (Model Context Protocol) as the universal glue: device-side MCP exposes hardware controls like LEDs and servos, while cloud-side MCP extends the LLM to smart home gear, desktop automation, even email. This means the same voice that tells you the weather can dim your lights or check your calendar, with the model deciding which tool to call.
Key highlights
- Supports 70+ hardware variants from breadboards to commercial dev boards (M5Stack, LILYGO, Waveshare, etc.)
- Offline wake-word detection via ESP-SR; audio compressed with OPUS
- Dual protocol support: WebSocket or MQTT+UDP hybrid
- Free tier available through official xiaozhi.me server using Qwen real-time model
- Custom wake words, fonts, and emojis via web-based asset generator
- Active ecosystem: Python, Java, and Go server implementations; Android and Linux clients
Caveats
- v2 firmware breaks OTA compatibility with v1; manual reflash required
- v1 branch maintained only until February 2026
- Documentation and community primarily Chinese-language; English coverage exists but is thinner
Verdict
Hardware hackers and IoT builders who want LLM voice control without a Raspberry Pi budget should grab this. Pure software developers looking for a cloud API should look elsewhere — the value is in the firmware-to-silicon integration.