Shrink-wrapping speech recognition for chips that still use kilobytes
ARM's reference implementation squeezes 'Hey Siri' class keyword spotting onto Cortex-M microcontrollers with no floating-point unit to spare.

What it does This is the companion repo to ARM’s 2017 “Hello Edge” paper: training scripts, frozen TensorFlow graphs, and deployment code for running wake-word detection on microcontrollers with single-digit megahertz and hundreds of kilobytes of RAM. You train a DNN/CNN/LSTM/GRU/CRNN/DS-CNN in Python, quantize it, then run inference on a Cortex-M board using CMSIS-NN.
The interesting bit The repo doesn’t just hand you models—it hands you a menu of memory-accuracy tradeoffs. The included table (and pretrained .pb files) lets you pick whether you want a 79 KB DNN that runs in 50K ops or a heftier DS-CNN that buys you a few more points of accuracy. The quantization guide and Cortex-M example code close the loop from training script to silicon.
Key highlights
- Seven architectures pretrained and ready to freeze: DNN, CNN, Basic LSTM, LSTM, GRU, CRNN, DS-CNN
model_size_infoCLI argument lets you specify layer dimensions without touching code- Full reproduction commands for every paper result in
train_commands.txt - Deployment folder includes quantization guide and bare-metal Cortex-M inference example
- Direct lineage: adapted from TensorFlow’s speech_commands example, but self-contained
Caveats
- The README is sparse on the deployment side: “example code… is provided here” with no detail on which boards, clock speeds, or measured latencies
- Pretrained models and paper date to 2017; TensorFlow 1.x era
.pbformat, so expect friction with modern TF/PyTorch workflows
Verdict Worth a look if you’re actually shipping voice wake-words on constrained silicon and need a proven baseline to beat. Skip it if you’re doing keyword spotting on anything with an application processor—this is a museum piece for a specific, very small room.