anthropics/hh-rlhf
A dataset of human preference comparisons and red-teaming data used to train AI assistants via RLHF.

This repository provides two datasets from Anthropic for training AI systems. The human preference dataset contains chosen/rejected response pairs annotated for helpfulness and harmlessness, intended for training via Reinforcement Learning from Human Feedback. The red-teaming dataset contains adversarial inputs designed to probe model vulnerabilities. Both datasets support research aimed at reducing AI harm and improving model behavior. The repo is now deprecated in favor of the HuggingFace-hosted version.