niderhoff/nlp-datasets
An alphabetically organized collection of public domain and freely available text datasets for NLP research and model training.

Velocity · 7d
+1.6
★ / day
Trend
→steady
star history
This repository provides an alphabetical list of free text datasets intended for Natural Language Processing tasks. It covers diverse sources including web archives, blog posts, product reviews, academic papers, email corpora, and conversational data spanning English and multiple languages. The collection serves as a reference for researchers and developers seeking training data for language models and NLP applications.