PKU-Alignment/safe-rlhf
An RLHF framework for training value-aligned LLMs with safety constraints, developed by Peking University's alignment team.

Beaver is a modular RLHF framework that supports supervised fine-tuning (SFT), RLHF, and Safe RLHF training for popular pre-trained models including LLaMA, OPT, and Baichuan. It provides human-labeled preference datasets (up to 1M pairs) combining helpful and harmless labels, pre-trained reward and cost model checkpoints, and multi-scale safety evaluation metrics. The project implements the Safe RLHF method for constrained value alignment, accepted at ICLR 2024.