A shopping list for speech data that doesn't cost your soul
A curated, license-sorted directory of open speech corpora for ASR, TTS, and voice research.

What it does
This repository is a hand-maintained spreadsheet-in-markdown of open speech datasets, organized by license type. Each entry lists language, hours, speaker count, download link, and the exact license — CC-0, CC-BY, CC-BY-SA, and beyond. It covers the usual suspects (LibriSpeech, Common Voice, VCTK) alongside obscure gems like Kʼicheʼ parliamentary recordings and isiXhosa speech corpora.
The interesting bit
The curation is opinionated: the maintainers prefer “truly open” corpora and flag when a dataset merely claims openness. The license-first organization is the quietly useful part — it saves you from downloading 500 hours of audio only to discover you can’t ship it commercially.
Key highlights
- CC-0 section includes Mozilla Common Voice (>15,000 validated hours) and Nordic language banks often overlooked in English-centric lists
- CC-BY section spans 11 South African languages via the NCHLT project, plus Icelandic parliamentary speech (542 hours)
- Maintained by Coqui, the open-source voice AI project, so the list reflects actual training needs rather than academic completeness
- Open Issues track backlog of datasets to add; PRs explicitly welcomed
- Covers not just ASR and TTS but speech separation, emotion recognition, and voice cloning use cases
Caveats
- README admits “not all these corpora may meet” the free-and-open criteria; some entries are accessible but not fully open
- Long backlog of unadded corpora in Issues; coverage is incomplete
- No quality ratings or preprocessing notes — you still need to kick the tires yourself
Verdict
Essential bookmark if you’re training or fine-tuning voice models and tired of hunting license terms. Skip it if you need a unified download API or preprocessed tensors — this is pure directory, not infrastructure.