A junk drawer of labeled entity pairs, curated with care
Someone finally collected all the scattered NLP relation-extraction datasets into one repo so you don't have to hunt through decade-old conference websites.

What it does
This repository gathers 20+ publicly available datasets for training supervised models to extract semantic relationships between entities or nominals. It covers English and Portuguese, spans 2005–2020, and sorts everything into three buckets: traditional closed-class relation extraction, open information extraction (untyped relations), and distantly supervised data.
The interesting bit
The curation is the product. Each dataset includes original citations, year, language, and class count in tidy tables—no more digging through ACL Anthology PDFs to figure out what SemEval 2010 Task 8 actually contains. The author also distinguishes annotation regimes that papers often conflate: manually labeled, open-class, and silver-standard distant supervision.
Key highlights
- 13 traditional IE datasets, including classics like SemEval 2007/2010, AImed (protein interactions), and Wikipedia person-to-person relations with 53 labels
- 4 open IE datasets: ReVerb, ClausIE, and two IJCNLP/EMNLP sets
- 4 distantly supervised sets, including Google’s 2013 relation extraction corpus and a 2020 hybrid distant-supervision-plus-crowdsourcing corpus for phenotype-gene relations
- Portuguese coverage: ReRelEM (4 relation types) and DBpediaRelations-PT (10 types, manually revised after distant supervision)
- All datasets hosted directly or linked with original paper citations
Caveats
- README descriptions vary in depth; some datasets get paragraphs, others get a sentence
- No code, no loaders, no unified format—this is purely a data catalog with downloads
- A few entries are external links rather than hosted files (e.g., Riedel’s 2010 ECML data, Google’s corpus)
Verdict
Worth bookmarking if you’re building or benchmarking relation extractors and need to know which dataset fits your language, domain, and supervision setup. Skip it if you want preprocessed tensors or a training framework—this is just the raw material, well-organized.