MNIST, but you can hear it coming
A tidy, versioned dataset of 3,000 spoken digits for when your model needs to learn what "seven" sounds like at 8kHz.

What it does
FSDD collects wav recordings of English digits 0-9 from six speakers, trimmed to near-minimal silence and organized with predictable filenames like 7_jackson_32.wav. It ships with a prescribed train/test split (first 10% of each speaker’s recordings) and a small Python API for loading data and generating spectrograms.
The interesting bit
The dataset is deliberately boring in the right ways: fixed 8kHz mono, consistent naming, Zenodo DOI versioning for reproducibility, and a metadata.py tracking speaker gender and accent. That predictability is the point — it’s audio MNIST in spirit, a quick sanity-check substrate for speech pipelines.
Key highlights
- 3,000 recordings: 50 per digit per speaker across 6 speakers
- Pre-built spectrograms and a
trimmer.pyutility for silence-hacking your own additions - Direct loaders for PyTorch, TensorFlow, and the Hub ecosystem
- 50+ scholarly citations, plus wrappers in TensorFlow Datasets and Accord.NET
- CC BY-SA 4.0, with explicit contribution workflow for growing the corpus
Caveats
- English-only, six speakers — accent and demographic coverage is narrow
- 8kHz is telephone-grade; don’t expect rich phonetic detail
- The “Made with FSDD” list is self-reported and not curated
Verdict Grab this if you need a lightweight, reproducible spoken-digit baseline or a teaching dataset. Skip it if you’re building a production voice interface — the speaker count and sampling rate won’t generalize far.