Is nlp-datasets open source?

Yes — niderhoff/nlp-datasets is an open-source project tracked on heatdrop.

How popular is nlp-datasets?

niderhoff/nlp-datasets has 6k stars on GitHub.

Where can I find nlp-datasets?

niderhoff/nlp-datasets is on GitHub at https://github.com/niderhoff/nlp-datasets.

niderhoff/nlp-datasets

A field guide to the internet's text dumps

An alphabetical index of free and public-domain text datasets, because finding training data shouldn't require a research librarian.

★6k stars Data Tooling Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does This repository is essentially a very thorough README: an alphabetical index of free and public-domain text datasets for NLP. Each entry links to the source, notes the size, and gives a one-sentence description so you can find training data without opening twenty tabs. The author is upfront that most of it is raw, unstructured text; if you need annotated corpora or Treebanks, the bottom of the page sends you elsewhere.

The interesting bit The list’s value is breadth, not depth. It jumps from a 541 TB Common Crawl to 200 KB of SMS spam, covering everything from Enron emails and Hillary Clinton’s heavily redacted inbox to South Park scripts and Texas death-row last words. That eclectic range is exactly why it has almost 6,000 stars—it collects the internet’s scattered text archives in one place.

Key highlights

Scale spans kilobytes to terabytes: Common Crawl (541 TB), ArXiv (270 GB papers plus 190 GB sourcefiles), down to 200 KB SMS spam
Sources include Kaggle competitions, AWS Open Data, academic hosts, Reddit dumps, and government records
Explicitly limited to raw unstructured text; annotated corpora are deferred to separate references
Niche inclusions: Jeopardy questions, Diplomacy game messages annotated for truthfulness, and material safety datasheets
Access friction on a few entries: the Reuters Corpus requires a signed agreement sent by post, and some academic sets are available only on request

Caveats

This is a curated list, not a data mirror; every link sends you to an external host
The README warns that most entries are raw, unstructured text, so plan to do your own cleaning and labeling
A few datasets require extra steps: Reuters needs a postal agreement, and some corpora are available on request

Verdict Bookmark this if you are building language models and tired of rediscovering dataset URLs. Skip it if you need pre-annotated corpora or a unified download interface.

Frequently asked

What is niderhoff/nlp-datasets?: An alphabetical index of free and public-domain text datasets, because finding training data shouldn't require a research librarian.
Is nlp-datasets open source?: Yes — niderhoff/nlp-datasets is an open-source project tracked on heatdrop.
How popular is nlp-datasets?: niderhoff/nlp-datasets has 6k stars on GitHub.
Where can I find nlp-datasets?: niderhoff/nlp-datasets is on GitHub at https://github.com/niderhoff/nlp-datasets.