← all repositories
juand-r/entity-recognition-datasets

A phone book for machines: 1,500+ NER datasets, one repo

Curated links to named-entity recognition corpora across languages, domains, and increasingly weird licenses.

1.6k stars Python Data ToolingLearning
entity-recognition-datasets
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

This repository is a curated index of annotated datasets for named entity recognition (NER) and related extraction tasks. It spans news, medical records, Twitter, finance, malware reports, astrophysics papers, and parliamentary debates across dozens of languages. Some datasets are included directly; many more are documented with download pointers and license notes, plus conversion scripts to CoNLL 2003 format where licensing prevents redistribution.

The interesting bit

The real value is the license archaeology. The maintainer has done the tedious work of tracking down whether you can legally use OntoNotes (LDC), re3d (six different licenses in one bundle), or that German legal document corpus buried in a LREC preprint. For researchers tired of starting every project with a week of rights clearance, this is a significant head start.

Key highlights

  • English coverage runs from classic CoNLL 2003 to niche domains: robotics assembly instructions, SEC filings, music NER, and astrophysics (WIESP2022)
  • Multilingual scope includes code-switching corpora (Spanish-English tweets, Hindi-English social media), historical documents, and legal text in German and Dutch
  • Includes conversion utilities for restricted datasets that can’t be redistributed directly
  • Some datasets bundled directly: WikiGold, WNUT17, AnEM, re3d, SEC-filings, BTC, GUM 3.1.0
  • 1,573 stars suggests this fills a genuine gap in the NLP tooling landscape

Caveats

  • Maintainer noted in 2020 they are “no longer actively adding datasets”; newer resources (post-2020) are likely missing unless contributed via PR
  • Several entries are just links with no local code or validation — you’ll still need to wrangle format inconsistencies yourself
  • “Assembly” dataset listed with “X” for both license and availability; unclear if this is unavailable or merely undocumented

Verdict

Worth bookmarking if you’re building or benchmarking NER models and need to survey what’s out there without drowning in LDC catalog pages. Less useful if you want a unified download-and-train pipeline; this is a map, not a framework.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.