A phone book for machines: 1,500+ NER datasets, one repo
Curated links to named-entity recognition corpora across languages, domains, and increasingly weird licenses.

What it does
This repository is a curated index of annotated datasets for named entity recognition (NER) and related extraction tasks. It spans news, medical records, Twitter, finance, malware reports, astrophysics papers, and parliamentary debates across dozens of languages. Some datasets are included directly; many more are documented with download pointers and license notes, plus conversion scripts to CoNLL 2003 format where licensing prevents redistribution.
The interesting bit
The real value is the license archaeology. The maintainer has done the tedious work of tracking down whether you can legally use OntoNotes (LDC), re3d (six different licenses in one bundle), or that German legal document corpus buried in a LREC preprint. For researchers tired of starting every project with a week of rights clearance, this is a significant head start.
Key highlights
- English coverage runs from classic CoNLL 2003 to niche domains: robotics assembly instructions, SEC filings, music NER, and astrophysics (WIESP2022)
- Multilingual scope includes code-switching corpora (Spanish-English tweets, Hindi-English social media), historical documents, and legal text in German and Dutch
- Includes conversion utilities for restricted datasets that can’t be redistributed directly
- Some datasets bundled directly: WikiGold, WNUT17, AnEM, re3d, SEC-filings, BTC, GUM 3.1.0
- 1,573 stars suggests this fills a genuine gap in the NLP tooling landscape
Caveats
- Maintainer noted in 2020 they are “no longer actively adding datasets”; newer resources (post-2020) are likely missing unless contributed via PR
- Several entries are just links with no local code or validation — you’ll still need to wrangle format inconsistencies yourself
- “Assembly” dataset listed with “X” for both license and availability; unclear if this is unavailable or merely undocumented
Verdict
Worth bookmarking if you’re building or benchmarking NER models and need to survey what’s out there without drowning in LDC catalog pages. Less useful if you want a unified download-and-train pipeline; this is a map, not a framework.