uclaml/SPPO
Self-Play Preference Optimization is a self-play framework for language model alignment with a new learning objective, released with trained model weights.

The repository provides the official implementation of SPPO, a self-play-based method for aligning language models using a novel learning objective derived from game theory. It includes training scripts for fine-tuning LLMs, evaluation pipelines on benchmarks like AlpacaEval 2.0 and Open LLM Leaderboard, and released model checkpoints. The approach frames alignment as a competitive two-player game where the model improves by playing against itself.