← all repositories

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models developed at Stanford, using length-controlled win rates correlated 0.98 with ChatBot Arena.

2k stars Jupyter Notebook LLMOps · EvalLanguage Models
alpaca_eval
Velocity · 7d
+1.8
★ / day
Trend
steady
star history

AlpacaEval is an automated benchmarking system for evaluating chat LLMs. It uses GPT-4 as an annotator to compare model outputs against a baseline and computes win rates to rank models. The system implements length-controlled win rates to prevent length gaming and achieves 0.98 Spearman correlation with ChatBot Arena while costing under $10 and completing in under 5 minutes. It serves as a fast, cheap alternative to human preference evaluation for LLM leaderboards.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.