Joyce94/LLM-RLHF-Tuning
A complete RLHF training framework implementing SFT, reward modeling, PPO, and DPO for fine-tuning language models with LoRA/PEFT.

This repository provides a from-scratch implementation of the three-stage RLHF (Reinforcement Learning from Human Feedback) training pipeline for large language models. It supports supervised fine-tuning (SFT), reward model (RM) training, PPO (Proximal Policy Optimization) training, and DPO (Direct Preference Optimization) training. The framework leverages PEFT and LoRA for parameter-efficient fine-tuning, supporting LLaMA, LLaMA2, and Alpaca models with distributed training via accelerate and DeepSpeed.