Is hate-speech-and-offensive-language open source?

Yes — t-davidson/hate-speech-and-offensive-language is open source, released under the MIT license.

What language is hate-speech-and-offensive-language written in?

t-davidson/hate-speech-and-offensive-language is primarily written in Jupyter Notebook.

How popular is hate-speech-and-offensive-language?

t-davidson/hate-speech-and-offensive-language has 846 stars on GitHub.

Where can I find hate-speech-and-offensive-language?

t-davidson/hate-speech-and-offensive-language is on GitHub at https://github.com/t-davidson/hate-speech-and-offensive-language.

← all repositories

t-davidson/hate-speech-and-offensive-language

A 2017 hate-speech dataset that sparked a field — and a reckoning

The ICWSM paper and Python 2.7 code that showed how easily "offensive" and "hate speech" get conflated, with follow-up work finding racial bias in the labels themselves.

★846 stars Jupyter Notebook Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository holds the original dataset, lexicon, and Jupyter notebooks from a 2017 ICWSM paper on automated hate speech detection. The authors labeled ~25K tweets into “hate speech,” “offensive language,” and “neither,” then built classifiers to separate the three. Everything is provided as Python 2.7 pickles and notebooks, plus a standalone classifier script for new data.

The interesting bit

The paper’s core argument — that “offensive” and “hate speech” are routinely muddled by annotators and models alike — turned out to be prescient. The authors later published follow-up work (2019) finding racial bias embedded in this very dataset, making the repo a case study in how early NLP benchmark datasets can inherit and amplify the prejudices of their annotators.

Key highlights

~25K manually labeled tweets with three-way classification (hate speech / offensive / neither)
Custom lexicon generated to improve hate speech detection accuracy
Pre-built classifier pipeline with test case for running on new data
CSV and Python 2.7 pickle formats provided
Explicit content warnings throughout; authors track usage via a contact form

Caveats

Repository is no longer maintained; author explicitly rejects issues and pull requests about Python/package compatibility
Code is Python 2.7, now two major versions behind
The 2019 follow-up paper identified racial bias in this dataset; the README links to it but does not integrate those findings into the original materials

Verdict

Worth studying if you work on content moderation, dataset ethics, or the history of NLP bias research — less useful if you need production-ready tooling. Treat it as a time-capsule paper artifact, not a dependency.

Frequently asked

What is t-davidson/hate-speech-and-offensive-language?: The ICWSM paper and Python 2.7 code that showed how easily "offensive" and "hate speech" get conflated, with follow-up work finding racial bias in the labels themselves.
Is hate-speech-and-offensive-language open source?: Yes — t-davidson/hate-speech-and-offensive-language is open source, released under the MIT license.
What language is hate-speech-and-offensive-language written in?: t-davidson/hate-speech-and-offensive-language is primarily written in Jupyter Notebook.
How popular is hate-speech-and-offensive-language?: t-davidson/hate-speech-and-offensive-language has 846 stars on GitHub.
Where can I find hate-speech-and-offensive-language?: t-davidson/hate-speech-and-offensive-language is on GitHub at https://github.com/t-davidson/hate-speech-and-offensive-language.