← all repositories
davidsbatista/Annotated-Semantic-Relationships-Datasets

A junk drawer of labeled entity pairs, curated with care

Someone finally collected all the scattered NLP relation-extraction datasets into one repo so you don't have to hunt through decade-old conference websites.

707 stars Data Tooling
Annotated-Semantic-Relationships-Datasets
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This repository gathers 20+ publicly available datasets for training supervised models to extract semantic relationships between entities or nominals. It covers English and Portuguese, spans 2005–2020, and sorts everything into three buckets: traditional closed-class relation extraction, open information extraction (untyped relations), and distantly supervised data.

The interesting bit

The curation is the product. Each dataset includes original citations, year, language, and class count in tidy tables—no more digging through ACL Anthology PDFs to figure out what SemEval 2010 Task 8 actually contains. The author also distinguishes annotation regimes that papers often conflate: manually labeled, open-class, and silver-standard distant supervision.

Key highlights

  • 13 traditional IE datasets, including classics like SemEval 2007/2010, AImed (protein interactions), and Wikipedia person-to-person relations with 53 labels
  • 4 open IE datasets: ReVerb, ClausIE, and two IJCNLP/EMNLP sets
  • 4 distantly supervised sets, including Google’s 2013 relation extraction corpus and a 2020 hybrid distant-supervision-plus-crowdsourcing corpus for phenotype-gene relations
  • Portuguese coverage: ReRelEM (4 relation types) and DBpediaRelations-PT (10 types, manually revised after distant supervision)
  • All datasets hosted directly or linked with original paper citations

Caveats

  • README descriptions vary in depth; some datasets get paragraphs, others get a sentence
  • No code, no loaders, no unified format—this is purely a data catalog with downloads
  • A few entries are external links rather than hosted files (e.g., Riedel’s 2010 ECML data, Google’s corpus)

Verdict

Worth bookmarking if you’re building or benchmarking relation extractors and need to know which dataset fits your language, domain, and supervision setup. Skip it if you want preprocessed tensors or a training framework—this is just the raw material, well-organized.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.