← all repositories

PKU-Alignment/safe-rlhf

An RLHF framework for training value-aligned LLMs with safety constraints, developed by Peking University's alignment team.

safe-rlhf
Velocity · 7d
+1.4
★ / day
Trend
steady
star history

Beaver is a modular RLHF framework that supports supervised fine-tuning (SFT), RLHF, and Safe RLHF training for popular pre-trained models including LLaMA, OPT, and Baichuan. It provides human-labeled preference datasets (up to 1M pairs) combining helpful and harmless labels, pre-trained reward and cost model checkpoints, and multi-scale safety evaluation metrics. The project implements the Safe RLHF method for constrained value alignment, accepted at ICLR 2024.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.