Is hh-rlhf open source?

Yes — anthropics/hh-rlhf is open source, released under the MIT license.

How popular is hh-rlhf?

anthropics/hh-rlhf has 1.8k stars on GitHub.

Where can I find hh-rlhf?

anthropics/hh-rlhf is on GitHub at https://github.com/anthropics/hh-rlhf.

← all repositories

anthropics/hh-rlhf

Anthropic’s RLHF dataset: preference pairs and red-team transcripts

Raw preference pairs and adversarial transcripts from Anthropic’s early RLHF alignment experiments.

★1.8k stars Data Tooling Learning

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository distributes two JSONL datasets from Anthropic’s early alignment research: human preference pairs rating helpfulness and harmlessness, and red-team transcripts of adversarial conversations against AI assistants. Each preference record contrasts a “chosen” response against a “rejected” one, while the red-teaming logs include crowdworker ratings, model parameter counts, and automated harmlessness scores. Anthropic has since deprecated the GitHub mirror in favor of an identical HuggingFace distribution.

The interesting bit

The red-teaming dataset is unusually detailed for a public release, tagging each attack with the adversary’s self-reported success rating, their platform of origin—Upwork or MTurk—and a preference-model-derived harmlessness score for both the task description and the full transcript. It reads like a public incident-response log for a 52B-parameter model’s worst moments.

Key highlights

Preference data spans three training tranches for helpfulness (base model, rejection-sampled, and online iterated) but only base-model data for harmlessness.
Red-team records include transcript, num_params, model_type, rating, task_description, and per-sample harmlessness scores.
Tags describing attack strategies were crowdsourced post-hoc, though only for a random 1,000-sample subset across two of the four model types.
Anthropic explicitly warns that the data contains offensive, violent, and discriminatory content intended solely for research that reduces model harms.

Caveats

The GitHub repository is deprecated; the canonical source is now the HuggingFace Anthropic/hh-rlhf dataset.
Red-team tags are sparse, covering only a random sample of 1,000 attempts for two of the four model types used.

Verdict

Researchers building or benchmarking RLHF pipelines and AI safety tooling should use the HuggingFace version; casual developers looking for a quick alignment drop-in will find only raw data here.

Frequently asked

What is anthropics/hh-rlhf?: Raw preference pairs and adversarial transcripts from Anthropic’s early RLHF alignment experiments.
Is hh-rlhf open source?: Yes — anthropics/hh-rlhf is open source, released under the MIT license.
How popular is hh-rlhf?: anthropics/hh-rlhf has 1.8k stars on GitHub.
Where can I find hh-rlhf?: anthropics/hh-rlhf is on GitHub at https://github.com/anthropics/hh-rlhf.