Is entity-recognition-datasets open source?

Yes — juand-r/entity-recognition-datasets is open source, released under the MIT license.

What language is entity-recognition-datasets written in?

juand-r/entity-recognition-datasets is primarily written in Python.

How popular is entity-recognition-datasets?

juand-r/entity-recognition-datasets has 1.6k stars on GitHub.

Where can I find entity-recognition-datasets?

juand-r/entity-recognition-datasets is on GitHub at https://github.com/juand-r/entity-recognition-datasets.

← all repositories

juand-r/entity-recognition-datasets

A phone book for machines: 1,500+ NER datasets, one repo

Curated links to named-entity recognition corpora across languages, domains, and increasingly weird licenses.

★1.6k stars Python Data Tooling Learning

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository is a curated index of annotated datasets for named entity recognition (NER) and related extraction tasks. It spans news, medical records, Twitter, finance, malware reports, astrophysics papers, and parliamentary debates across dozens of languages. Some datasets are included directly; many more are documented with download pointers and license notes, plus conversion scripts to CoNLL 2003 format where licensing prevents redistribution.

The interesting bit

The real value is the license archaeology. The maintainer has done the tedious work of tracking down whether you can legally use OntoNotes (LDC), re3d (six different licenses in one bundle), or that German legal document corpus buried in a LREC preprint. For researchers tired of starting every project with a week of rights clearance, this is a significant head start.

Key highlights

English coverage runs from classic CoNLL 2003 to niche domains: robotics assembly instructions, SEC filings, music NER, and astrophysics (WIESP2022)
Multilingual scope includes code-switching corpora (Spanish-English tweets, Hindi-English social media), historical documents, and legal text in German and Dutch
Includes conversion utilities for restricted datasets that can’t be redistributed directly
Some datasets bundled directly: WikiGold, WNUT17, AnEM, re3d, SEC-filings, BTC, GUM 3.1.0
1,573 stars suggests this fills a genuine gap in the NLP tooling landscape

Caveats

Maintainer noted in 2020 they are “no longer actively adding datasets”; newer resources (post-2020) are likely missing unless contributed via PR
Several entries are just links with no local code or validation — you’ll still need to wrangle format inconsistencies yourself
“Assembly” dataset listed with “X” for both license and availability; unclear if this is unavailable or merely undocumented

Verdict

Worth bookmarking if you’re building or benchmarking NER models and need to survey what’s out there without drowning in LDC catalog pages. Less useful if you want a unified download-and-train pipeline; this is a map, not a framework.

Frequently asked

What is juand-r/entity-recognition-datasets?: Curated links to named-entity recognition corpora across languages, domains, and increasingly weird licenses.
Is entity-recognition-datasets open source?: Yes — juand-r/entity-recognition-datasets is open source, released under the MIT license.
What language is entity-recognition-datasets written in?: juand-r/entity-recognition-datasets is primarily written in Python.
How popular is entity-recognition-datasets?: juand-r/entity-recognition-datasets has 1.6k stars on GitHub.
Where can I find entity-recognition-datasets?: juand-r/entity-recognition-datasets is on GitHub at https://github.com/juand-r/entity-recognition-datasets.