← all repositories
t-davidson/hate-speech-and-offensive-language

A 2017 hate-speech dataset that sparked a field — and a reckoning

The ICWSM paper and Python 2.7 code that showed how easily "offensive" and "hate speech" get conflated, with follow-up work finding racial bias in the labels themselves.

842 stars Jupyter Notebook Data Tooling
hate-speech-and-offensive-language
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This repository holds the original dataset, lexicon, and Jupyter notebooks from a 2017 ICWSM paper on automated hate speech detection. The authors labeled ~25K tweets into “hate speech,” “offensive language,” and “neither,” then built classifiers to separate the three. Everything is provided as Python 2.7 pickles and notebooks, plus a standalone classifier script for new data.

The interesting bit

The paper’s core argument — that “offensive” and “hate speech” are routinely muddled by annotators and models alike — turned out to be prescient. The authors later published follow-up work (2019) finding racial bias embedded in this very dataset, making the repo a case study in how early NLP benchmark datasets can inherit and amplify the prejudices of their annotators.

Key highlights

  • ~25K manually labeled tweets with three-way classification (hate speech / offensive / neither)
  • Custom lexicon generated to improve hate speech detection accuracy
  • Pre-built classifier pipeline with test case for running on new data
  • CSV and Python 2.7 pickle formats provided
  • Explicit content warnings throughout; authors track usage via a contact form

Caveats

  • Repository is no longer maintained; author explicitly rejects issues and pull requests about Python/package compatibility
  • Code is Python 2.7, now two major versions behind
  • The 2019 follow-up paper identified racial bias in this dataset; the README links to it but does not integrate those findings into the original materials

Verdict

Worth studying if you work on content moderation, dataset ethics, or the history of NLP bias research — less useful if you need production-ready tooling. Treat it as a time-capsule paper artifact, not a dependency.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.