tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models developed at Stanford, using length-controlled win rates correlated 0.98 with ChatBot Arena.

AlpacaEval is an automated benchmarking system for evaluating chat LLMs. It uses GPT-4 as an annotator to compare model outputs against a baseline and computes win rates to rank models. The system implements length-controlled win rates to prevent length gaming and achieves 0.98 Spearman correlation with ChatBot Arena while costing under $10 and completing in under 5 minutes. It serves as a fast, cheap alternative to human preference evaluation for LLM leaderboards.