← all repositories

uclaml/SPPO

Self-Play Preference Optimization is a self-play framework for language model alignment with a new learning objective, released with trained model weights.

586 stars Python Language ModelsML Frameworks
SPPO
Velocity · 7d
+0.8
★ / day
Trend
steady
star history

The repository provides the official implementation of SPPO, a self-play-based method for aligning language models using a novel learning objective derived from game theory. It includes training scripts for fine-tuning LLMs, evaluation pipelines on benchmarks like AlpacaEval 2.0 and Open LLM Leaderboard, and released model checkpoints. The approach frames alignment as a competitive two-player game where the model improves by playing against itself.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.