← all repositories
coqui-ai/open-speech-corpora

A shopping list for speech data that doesn't cost your soul

A curated, license-sorted directory of open speech corpora for ASR, TTS, and voice research.

open-speech-corpora
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

This repository is a hand-maintained spreadsheet-in-markdown of open speech datasets, organized by license type. Each entry lists language, hours, speaker count, download link, and the exact license — CC-0, CC-BY, CC-BY-SA, and beyond. It covers the usual suspects (LibriSpeech, Common Voice, VCTK) alongside obscure gems like Kʼicheʼ parliamentary recordings and isiXhosa speech corpora.

The interesting bit

The curation is opinionated: the maintainers prefer “truly open” corpora and flag when a dataset merely claims openness. The license-first organization is the quietly useful part — it saves you from downloading 500 hours of audio only to discover you can’t ship it commercially.

Key highlights

  • CC-0 section includes Mozilla Common Voice (>15,000 validated hours) and Nordic language banks often overlooked in English-centric lists
  • CC-BY section spans 11 South African languages via the NCHLT project, plus Icelandic parliamentary speech (542 hours)
  • Maintained by Coqui, the open-source voice AI project, so the list reflects actual training needs rather than academic completeness
  • Open Issues track backlog of datasets to add; PRs explicitly welcomed
  • Covers not just ASR and TTS but speech separation, emotion recognition, and voice cloning use cases

Caveats

  • README admits “not all these corpora may meet” the free-and-open criteria; some entries are accessible but not fully open
  • Long backlog of unadded corpora in Issues; coverage is incomplete
  • No quality ratings or preprocessing notes — you still need to kick the tires yourself

Verdict

Essential bookmark if you’re training or fine-tuning voice models and tired of hunting license terms. Skip it if you need a unified download API or preprocessed tensors — this is pure directory, not infrastructure.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.