Is open-speech-corpora open source?

Yes — coqui-ai/open-speech-corpora is open source, released under the MIT license.

How popular is open-speech-corpora?

coqui-ai/open-speech-corpora has 1.4k stars on GitHub.

Where can I find open-speech-corpora?

coqui-ai/open-speech-corpora is on GitHub at https://github.com/coqui-ai/open-speech-corpora.

← all repositories

coqui-ai/open-speech-corpora

A shopping list for speech data that doesn't cost your soul

A curated, license-sorted directory of open speech corpora for ASR, TTS, and voice research.

★1.4k stars Data Tooling Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository is a hand-maintained spreadsheet-in-markdown of open speech datasets, organized by license type. Each entry lists language, hours, speaker count, download link, and the exact license — CC-0, CC-BY, CC-BY-SA, and beyond. It covers the usual suspects (LibriSpeech, Common Voice, VCTK) alongside obscure gems like Kʼicheʼ parliamentary recordings and isiXhosa speech corpora.

The interesting bit

The curation is opinionated: the maintainers prefer “truly open” corpora and flag when a dataset merely claims openness. The license-first organization is the quietly useful part — it saves you from downloading 500 hours of audio only to discover you can’t ship it commercially.

Key highlights

CC-0 section includes Mozilla Common Voice (>15,000 validated hours) and Nordic language banks often overlooked in English-centric lists
CC-BY section spans 11 South African languages via the NCHLT project, plus Icelandic parliamentary speech (542 hours)
Maintained by Coqui, the open-source voice AI project, so the list reflects actual training needs rather than academic completeness
Open Issues track backlog of datasets to add; PRs explicitly welcomed
Covers not just ASR and TTS but speech separation, emotion recognition, and voice cloning use cases

Caveats

README admits “not all these corpora may meet” the free-and-open criteria; some entries are accessible but not fully open
Long backlog of unadded corpora in Issues; coverage is incomplete
No quality ratings or preprocessing notes — you still need to kick the tires yourself

Verdict

Essential bookmark if you’re training or fine-tuning voice models and tired of hunting license terms. Skip it if you need a unified download API or preprocessed tensors — this is pure directory, not infrastructure.

Frequently asked

What is coqui-ai/open-speech-corpora?: A curated, license-sorted directory of open speech corpora for ASR, TTS, and voice research.
Is open-speech-corpora open source?: Yes — coqui-ai/open-speech-corpora is open source, released under the MIT license.
How popular is open-speech-corpora?: coqui-ai/open-speech-corpora has 1.4k stars on GitHub.
Where can I find open-speech-corpora?: coqui-ai/open-speech-corpora is on GitHub at https://github.com/coqui-ai/open-speech-corpora.