← all repositories
espressif/esp-sr

Alexa on a $2 chip: Espressif's embedded speech stack

A full voice pipeline—wake word, command recognition, noise suppression, even TTS—packaged as drop-in components for ESP32 variants.

1.4k stars C Image · Video · Audio
esp-sr
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does ESP-SR is Espressif’s official speech recognition framework for its microcontrollers. It bundles an audio front-end (echo cancellation, noise suppression, voice activity detection), wake word detection (WakeNet), offline command recognition (MultiNet), and speech synthesis into component-sized chunks you drop into an ESP-IDF project.

The interesting bit The silicon targeting is unusually granular. There’s WakeNet9s for the bare-bones ESP32-C3 (no PSRAM, no SIMD), WakeNet9 for the S3, and WakeNet9l for fast-spoken wake words at a 1.3× resource premium. The two-mic audio front-end is even Amazon-qualified for Alexa Built-in—on a chip that costs less than a coffee.

Key highlights

  • Supports 300+ offline speech commands in Chinese and English via MultiNet, with per-chip model variants (mn2 through mn7)
  • Wake word catalog is extensive and slightly eccentric: “Hi,ESP”, “Jarvis”, “Computer”, “Hey,Willow”, plus dozens of Chinese options
  • TTS Pipeline V3 now trains wake words for Chinese, English, Japanese, and French; Korean, Spanish, Arabic, and others queued
  • AFE handles full-duplex echo cancellation, blind source separation, and deep noise suppression
  • Preliminary ESP32-S31 support added; migration guide exists for the V1→V2 breaking change

Caveats

  • Most wake words are trademarked examples (Alexa, etc.) with prominent legal disclaimers; commercial use requires your own rights or authorization
  • README is thorough on models but vague on actual RAM/flash budgets and latency numbers
  • V2.0 migration suggests earlier versions had enough breaking changes to warrant a dedicated guide

Verdict Worth a look if you’re building voice-controlled hardware on ESP32 and want to avoid cloud dependencies. Skip if you need continuous large-vocabulary recognition or non-Espressif silicon.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.